Tux3 Report: How fast can we fsync?
Daniel Phillips
daniel at phunq.net
Tue Apr 28 16:13:18 PDT 2015
Greetings,
This post is dedicated to Ted, who raised doubts a while back about
whether Tux3 can ever have a fast fsync:
https://lkml.org/lkml/2013/5/11/128
"Re: Tux3 Report: Faster than tmpfs, what?"
Ted suggested that Tux3's inherently asynchronous design would be a
limitation when it comes to synchronous operations like fsync. As he
put it, "any advantage of decoupling the front/back end is nullified"
because of temporal coupling. We found the opposite to be true: our
asynchronous design works as well for synchronous operations as it does
for any other, and Tux3 now has a high performance fsync to prove it.
Until now, Tux3 handled fsync as a full delta commit equivalent to
syncfs, just like Ext3. This is semantically correct but sometimes
commits more data than necessary and creates a bottleneck by serializing
all fsyncs. To optimize it, we added a mechanism for committing any
subset of dirty inodes separately from a full delta.
Like our normal delta commit, the design is asynchronous: front end
tasks produce fsync work and a backend task commits it. We now have two
backends, one to commit fsyncs and another to commit deltas, serialized
so the combination is single threaded and mostly lockless. Each fsync
moves an inode's dirty blocks from the front delta to the back delta,
then queues the inode for the backend. The backend spins away committing
fsyncs, opportunistically batching them up into group commits. The fsync
backend gives priority to the delta backend: whenever a full delta flush
is needed because of cache pressure or cache age, it finishes its fsync
work and gets out of the way.
This approach required only minimal changes to our existing delta commit
mechanism, mainly to support crash consistency. In particular, our delta
model did not need to be changed at all, and block forking protects
fsync commits in the same way as normal delta commits. That is, an inode
can be freely updated (even via mmap) while it is being fsynced. The
fsync data is isolated from those changes and the frontend does not
stall.
I measured fsync performance using a 7200 RPM disk as a virtual drive
under KVM, configured with cache=none so that asynchronous writes are
cached and synchronous writes translate into direct writes to the
block device. To focus purely on fsync, I wrote a small utility (at the
end of this post) that forks a number of tasks, each of which
continuously appends to and fsyncs its own file. For a single task doing
1,000 fsyncs of 1K each, we have:
Ext4: 34.34s
XFS: 23.63s
Btrfs: 34.84s
Tux3: 17.24s
Tux3 has a nice advantage for isolated fsyncs. This is mainly due to
writing a small number of blocks per commit, currently just five blocks
for a small file. As for a normal delta commit, we do not update bitmap
blocks or index nodes, but log logical changes instead, flushing out
the primary metadata with occasional "unify" deltas to control the log
size and keep replay fast. The 1,000 fsync test typically does about
three unifies, so we do in fact update all our metadata and pay that
cost, just not on every commit.
The win is bigger than it appears at first glance, because writing the
commit block takes about 16.5 ms. That would be similar for all the
filesystems tested, so factoring that out, we see that Tux3 must be
doing 10 times less work than XFS, 24 times less than Ext4 and 25 times
less than Btrfs. Over time, that gap should widen as we reduce our small
file commit overhead towards just two blocks.
In this single threaded test, we pay the full price for "communication
delays" with our asynchronous backend, which evidently does not amount
to much. Given that a context switch is on the order of microseconds
while writing the commit block is on the order of milliseconds, it is
unsurprising that two extra context switches just disappear in the
noise.
Things get more interesting with parallel fsyncs. In this test, each
task does ten fsyncs and task count scales from ten to ten thousand. We
see that all tested filesystems are able to combine fsyncs into group
commits, with varying degrees of success:
Tasks: 10 100 1,000 10,000
Ext4: 0.79s 0.98s 4.62s 61.45s
XFS: 0.75s 1.68s 20.97s 238.23s
Btrfs 0.53s 0.78s 3.80s 84.34s
Tux3: 0.27s 0.34s 1.00s 6.86s
(lower is better)
We expect sub-linear scaling with respect to tasks as opportunities to
combine commits increase, then linear scaling as total write traffic
begins to dominate. Tux3 still shows sub-linear scaling even at 10,000
tasks. XFS scales poorly, and also suffers from read starvation at the
high end, sometimes taking tens of seconds to cat a file or minutes to
list a directory. Ext4 and Tux3 exhibit no such issues, remaining
completely responsive under all tested loads. The bottom line for this
test is, Tux3 is twice as fast at fsync with a modest task count, and
the gap widens to nine times faster than its nearest competitor as task
count increases.
Is there any practical use for fast parallel fsync of tens of thousands
of tasks? This could be useful for a scalable transaction server that
sits directly on the filesystem instead of a database, as is the
fashion for big data these days. It certainly can't hurt to know that
if you need that kind of scaling, Tux3 will do it.
Of course, a pure fsync load could be viewed as somewhat unnatural. We
also need to know what happens under a realistic load with buffered
operations mixed with fsyncs. We turn to an old friend, dbench:
Dbench -t10
Tasks: 8 16 32
Ext4: 35.32 MB/s 34.08 MB/s 39.71 MB/s
XFS: 32.12 MB/s 25.08 MB/s 30.12 MB/s
Btrfs: 54.40 MB/s 75.09 MB/s 102.81 MB/s
Tux3: 85.82 MB/s 133.69 MB/s 159.78 MB/s
(higher is better)
Tux3 and Btrfs scale well and are way ahead of Ext4 and XFS, while Ext4
and XFS scale poorly or even negatively. Tux3 is the leader by a wide
margin, beating XFS by more than a factor of 5 at the high end.
Dbench -t10 -s (all file operations synchronous)
Tasks: 8 16 32
Ext4: 4.51 MB/s 6.25 MB/s 7.72 MB/s
XFS: 4.24 MB/s 4.77 MB/s 5.15 MB/s
Btrfs: 7.98 MB/s 13.87 MB/s 22.87 MB/s
Tux3: 15.41 MB/s 25.56 MB/s 39.15 MB/s
(higher is better)
With a pure synchronous load (O_SYNC) the ranking is not changed but the
gaps widen, and Tux3 outperforms XFS by a factor of 7.5.
At the risk of overgeneralizing, a trend seems to be emerging: the new,
write-anywhere designs run synchronous operations faster, combine
synchronous and asynchronous operations more efficiently, and scale
better to high task counts than the traditional journaling designs. If
there is to be an Ext5, it might be worth considering the merit of
abandoning the journal in favor of something along the lines of Tux3's
redirect-on-write and logging combination.
Getting back to Ted's question, perhaps an asynchronous design really is
a better idea all round, even for synchronous operations, and perhaps
there really is such a thing as an improved design that is not just a
different set of tradeoffs.
In the full disclosure department, Tux3 is still not properly optimized
in some areas. One of them is fragmentation: it is not very hard to
make Tux3 slow down by running long tests. Our current allocation
algorithm is completely naive - it just allocates the next available
block and wraps at the top of volume. After a few wraps, it makes a big
mess. So today we are not claiming victory in the benchmark department,
we still have some work to do. Today is just about fsync, for which it
is fair to say that Tux3 sets a new performance standard.
Regards,
Daniel
Footnote: While I was doing this work, Hirofumi set about improving our
full filesystem sync to support merging parallel full syncs into single
delta commits. That one change improved fsync performance so much that
I almost abandoned my separate fsync work. However, a true, separate
fsync with aggressive group commit eventually proved superior, so now we
have both: a high performance fsync and a full filesystem sync that is
nearly as fast under many loads.
/*
* syncs.c
*
* D.R. Phillips, 2015
*
* To build: c99 -Wall syncs.c -o syncs
* To run: ./syncs [<filename> [<syncs> [<tasks>]]]
*/
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <fcntl.h>
#include <sys/wait.h>
#include <errno.h>
#include <sys/stat.h>
char text[1024] = { "hello world!\n" };
int main(int argc, const char *argv[]) {
const char *basename = argc < 1 ? "foo" : argv[1];
char name[100];
int steps = argc < 3 ? 1 : atoi(argv[2]);
int tasks = argc < 4 ? 1 : atoi(argv[3]);
int err, fd;
for (int t = 0; t < tasks; t++) {
snprintf(name, sizeof name, "%s%i", basename, t);
if (!fork())
goto child;
}
for (int t = 0; t < tasks; t++)
wait(&err);
return 0;
child:
fd = creat(name, S_IRWXU);
for (int i = 0; i < steps; i++) {
write(fd, text, sizeof text);
fsync(fd);
}
return 0;
}
More information about the Tux3
mailing list