Front/back separation and first benchmarks
Daniel Phillips
phillips at phunq.net
Fri Dec 28 01:09:43 PST 2012
Two weeks ago, Hirofumi completed the kernel implementation of the last major
design element needed for realistic performance measurements, while operating
with expected atomicity and durability guarantees of a modern filesystem. I am
happy to be able to say that early results indicate that Tux3 now shows
performance competitive with other Linux filesystems and may possibly have
taken the lead in some respects. Here are some details of this recent work.
Front/back separation in Tux3 decouples front end filesystem updates (Posix
syscalls) from back end atomic delta transfer to media so that the user only
observes cache transfer latency, not the overhead of preparing dirty cache for
transfer to media, which is done in an asynchronous background task instead.
This relatively small change improves performance significantly, bringing Tux3
near the performance of Tmpfs, a pure cache filesystem. In particular, this
makes Tux3 somewhat faster than Ext4 and quite a lot faster than Btrfs for the
particular benchmarks we have run so far.
In spirit, front/back separation resembles the "delayed allocation" employed
by Ext4 to improve write performance significantly. However, Tux3 does not
limit this simply to disk allocations, instead it delays every kind of
filesystem change: create, delete and rename directory operations, inode
attribute changes, truncates, and in short, anything that affects disk. And
these operations are not just delayed, but hived off into an entirely separate
task context where they do not affect foreground task latency.
Implementing front/back separation efficiently, without major stalls where the
front end waits on the back end, was challenging, and credit for this great
work all goes to Hirofumi, who has checked in a stunning amount of beautiful,
reliable and highly performant code over the past few months. One of the
problems we needed to address was, what happens if a dirty data page is being
written out as part of a "delta" atomic update and a user task wants to
rewrite that same page? Should the user task wait until the the page has been
fully written to media, to avoid polluting the earlier delta with more recent
changes? That would stall the front end transaction, perhaps for several
milliseconds, which is not very nice. Instead, we "fork" the dirty page,
creating a new copy in cache that the front end can modify without affecting an
earlier delta, or worse, changing the page contents halfway through a DMA
transfer. This is done with lightweight or no locking to keep front end stalls
as few and brief as possible. Analysis with perf shows that our front end is
indeed very well decoupled from the back end, and this shows up as excellent
benchmark numbers.
Dirty page forking is a key technique for Tux3, and recent work by other
kernel developers suggest that similar techniques are likely to become
pervasive throughout the kernel. See the "stable page" work that has been
going on for the last few months. But Tux3 uses its fork technique to improve
performance, whereas so far the ongoing stable page work has slowed things
down. If Tux3 had nothing else to offer, an effective implementation of forking
would already be enough. But that is just one of several significant
innovations in Tux3 that are likely to change the way Linux filesystems are
designed and built. I will touch on some examples in upcoming posts.
With page forking, Tux3 implements stronger data consistency semantics than
have so far been seen on Linux, even stronger than Ext4's "data=journal" mode,
which has performance issues. To be clear about this point, when we compare
Tux3 performance to Ext4 performance, Tux3 is actually doing more than Ext4
because we guarantee that files will always be committed to disk with their
correct data and correct directory entries, no matter what kind of bad luck
you may have with sudden interruptions, and no matter whether you remember to
do fsync or fdatasync in all the right places. And we provide guarantees about
the order in which updates arrive on disk that may avoid the need for
performance-harming fsync operations in common situations. And we do not leave
data sitting in dirty cache for an unpredictable period. Instead, we commit
all dirty cache data to disk at each delta, and in a predictable order. You
would think that all the extra care we take with data consistency would slow
Tux3 down, but the opposite would now appear to be the case.
Our front/back separation is still not quite perfect. The front end still
stalls on some back end locks from time to time. We have a pretty good idea
what to do about it and it is on the list of things to do. But at this point,
it works well enough that other work has become higher priority.
The way Tux3 transfers log blocks to disk was also improved, gaining a little
more performance and widening what appears to be a performance lead for Tux3
in the cases we exercise with the Fsx and Fsstress filesystem stress scripts.
There are cases where Tux3 is still not the fastest Linux filesystem. For
example, our disk layout algorithm is too simplistic to avoid read
fragmentation under many common loads. And there is scalability work to do,
particularly in the areas of directory indexing and free resource management.
However, we are not aware of any cases where Tux3 is limited by its design to
being less than competitive as a general purpose filesystem on Linux.
Now here is the bad news: for today I will not post any actual numbers,
because some work still remains to prepare these properly. Coming soon.
Regards,
Daniel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://phunq.net/pipermail/tux3/attachments/20121228/94322142/attachment-0002.html>
More information about the Tux3
mailing list