Front/back separation and first benchmarks

Fri Dec 28 01:09:43 PST 2012

Two weeks ago, Hirofumi completed the kernel implementation of the last major 
design element needed for realistic performance measurements, while operating 
with expected atomicity and durability guarantees of a modern filesystem. I am 
happy to be able to say that early results indicate that Tux3 now shows 
performance competitive with other Linux filesystems and may possibly have 
taken the lead in some respects. Here are some details of this recent work.

Front/back separation in Tux3 decouples front end filesystem updates (Posix 
syscalls) from back end atomic delta transfer to media so that the user only 
observes cache transfer latency, not the overhead of preparing dirty cache for 
transfer to media, which is done in an asynchronous background task instead. 
This relatively small change improves performance significantly, bringing Tux3 
near the performance of Tmpfs, a pure cache filesystem. In particular, this 
makes Tux3 somewhat faster than Ext4 and quite a lot faster than Btrfs for the 
particular benchmarks we have run so far.

In spirit, front/back separation resembles the "delayed allocation" employed 
by Ext4 to improve write performance significantly. However, Tux3 does not  
limit this simply to disk allocations, instead it delays every kind of 
filesystem change: create, delete and rename directory operations, inode 
attribute changes, truncates, and in short, anything that affects disk. And 
these operations are not just delayed, but hived off into an entirely separate 
task context where they do not affect foreground task latency.

Implementing front/back separation efficiently, without major stalls where the 
front end waits on the back end, was challenging, and credit for this great 
work all goes to Hirofumi, who has checked in a stunning amount of beautiful, 
reliable and highly performant code over the past few months. One of the 
problems we needed to address was, what happens if a dirty data page is being 
written out as part of a "delta" atomic update and a user task wants to 
rewrite that same page? Should the user task wait until the the page has been 
fully written to media, to avoid polluting the earlier delta with more recent 
changes? That would stall the front end transaction, perhaps for several 
milliseconds, which is not very nice. Instead, we "fork" the dirty page, 
creating a new copy in cache that the front end can modify without affecting an 
earlier delta, or worse, changing the page contents halfway through a DMA 
transfer. This is done with lightweight or no locking to keep front end stalls 
as few and brief as possible. Analysis with perf shows that our front end is 
indeed very well decoupled from the back end, and this shows up as excellent 
benchmark numbers.

Dirty page forking is a key technique for Tux3, and recent work by other 
kernel developers suggest that similar techniques are likely to become 
pervasive throughout the kernel. See the "stable page" work that has been 
going on for the last few months. But Tux3 uses its fork technique to improve 
performance, whereas so far the ongoing stable page work has slowed things 
down. If Tux3 had nothing else to offer, an effective implementation of forking 
would already be enough. But that is just one of several significant 
innovations in Tux3 that are likely to change the way Linux filesystems are 
designed and built. I will touch on some examples in upcoming posts.

With page forking, Tux3 implements stronger data consistency semantics than 
have so far been seen on Linux, even stronger than Ext4's "data=journal" mode, 
which has performance issues. To be clear about this point, when we compare 
Tux3 performance to Ext4 performance, Tux3 is actually doing more than Ext4 
because we guarantee that files will always be committed to disk with their 
correct data and correct directory entries, no matter what kind of bad luck 
you may have with sudden interruptions, and no matter whether you remember to 
do fsync or fdatasync in all the right places. And we provide guarantees about 
the order in which updates arrive on disk that may avoid the need for 
performance-harming fsync operations in common situations. And we do not leave 
data sitting in dirty cache for an unpredictable period. Instead, we commit 
all dirty cache data to disk at each delta, and in a predictable order. You 
would think that all the extra care we take with data consistency would slow 
Tux3 down, but the opposite would now appear to be the case.

Our front/back separation is still not quite perfect. The front end still 
stalls on some back end locks from time to time. We have a pretty good idea 
what to do about it and it is on the list of things to do. But at this point, 
it works well enough that other work has become higher priority.

The way Tux3 transfers log blocks to disk was also improved, gaining a little 
more performance and widening what appears to be a performance lead for Tux3 
in the cases we exercise with the Fsx and Fsstress filesystem stress scripts.

There are cases where Tux3 is still not the fastest Linux filesystem. For 
example, our disk layout algorithm is too simplistic to avoid read 
fragmentation under many common loads. And there is scalability work to do, 
particularly in the areas of directory indexing and free resource management. 
However, we are not aware of any cases where Tux3 is limited by its design to 
being less than competitive as a general purpose filesystem on Linux.

Now here is the bad news: for today I will not post any actual numbers, 
because some work still remains to prepare these properly. Coming soon.

Regards,

Daniel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://phunq.net/pipermail/tux3/attachments/20121228/94322142/attachment.html>