Tux3 Report: How fast can we fsync?

Tue Apr 28 16:13:18 PDT 2015

Greetings,

This post is dedicated to Ted, who raised doubts a while back about 
whether Tux3 can ever have a fast fsync:

   https://lkml.org/lkml/2013/5/11/128
   "Re: Tux3 Report: Faster than tmpfs, what?"

Ted suggested that Tux3's inherently asynchronous design would be a 
limitation when it comes to synchronous operations like fsync. As he 
put it, "any advantage of decoupling the front/back end is nullified" 
because of temporal coupling. We found the opposite to be true: our 
asynchronous design works as well for synchronous operations as it does 
for any other, and Tux3 now has a high performance fsync to prove it.

Until now, Tux3 handled fsync as a full delta commit equivalent to 
syncfs, just like Ext3. This is semantically correct but sometimes 
commits more data than necessary and creates a bottleneck by serializing 
all fsyncs. To optimize it, we added a mechanism for committing any 
subset of dirty inodes separately from a full delta.

Like our normal delta commit, the design is asynchronous: front end 
tasks produce fsync work and a backend task commits it. We now have two 
backends, one to commit fsyncs and another to commit deltas, serialized 
so the combination is single threaded and mostly lockless. Each fsync 
moves an inode's dirty blocks from the front delta to the back delta, 
then queues the inode for the backend. The backend spins away committing 
fsyncs, opportunistically batching them up into group commits. The fsync 
backend gives priority to the delta backend: whenever a full delta flush 
is needed because of cache pressure or cache age, it finishes its fsync 
work and gets out of the way.

This approach required only minimal changes to our existing delta commit 
mechanism, mainly to support crash consistency. In particular, our delta 
model did not need to be changed at all, and block forking protects 
fsync commits in the same way as normal delta commits. That is, an inode 
can be freely updated (even via mmap) while it is being fsynced. The 
fsync data is isolated from those changes and the frontend does not 
stall.

I measured fsync performance using a 7200 RPM disk as a virtual drive 
under KVM, configured with cache=none so that asynchronous writes are 
cached and synchronous writes translate into direct writes to the 
block device. To focus purely on fsync, I wrote a small utility (at the 
end of this post) that forks a number of tasks, each of which 
continuously appends to and fsyncs its own file. For a single task doing 
1,000 fsyncs of 1K each, we have:

    Ext4:  34.34s
    XFS:   23.63s
    Btrfs: 34.84s
    Tux3:  17.24s

Tux3 has a nice advantage for isolated fsyncs. This is mainly due to 
writing a small number of blocks per commit, currently just five blocks 
for a small file. As for a normal delta commit, we do not update bitmap 
blocks or index nodes, but log logical changes instead, flushing out 
the primary metadata with occasional "unify" deltas to control the log 
size and keep replay fast. The 1,000 fsync test typically does about 
three unifies, so we do in fact update all our metadata and pay that 
cost, just not on every commit.

The win is bigger than it appears at first glance, because writing the 
commit block takes about 16.5 ms. That would be similar for all the 
filesystems tested, so factoring that out, we see that Tux3 must be 
doing 10 times less work than XFS, 24 times less than Ext4 and 25 times 
less than Btrfs. Over time, that gap should widen as we reduce our small 
file commit overhead towards just two blocks.

In this single threaded test, we pay the full price for "communication 
delays" with our asynchronous backend, which evidently does not amount 
to much. Given that a context switch is on the order of microseconds 
while writing the commit block is on the order of milliseconds, it is 
unsurprising that two extra context switches just disappear in the 
noise.

Things get more interesting with parallel fsyncs. In this test, each 
task does ten fsyncs and task count scales from ten to ten thousand. We 
see that all tested filesystems are able to combine fsyncs into group 
commits, with varying degrees of success:

    Tasks:   10      100    1,000    10,000
    Ext4:   0.79s   0.98s    4.62s    61.45s
    XFS:    0.75s   1.68s   20.97s   238.23s
    Btrfs   0.53s   0.78s    3.80s    84.34s
    Tux3:   0.27s   0.34s    1.00s     6.86s
              (lower is better)

We expect sub-linear scaling with respect to tasks as opportunities to 
combine commits increase, then linear scaling as total write traffic 
begins to dominate. Tux3 still shows sub-linear scaling even at 10,000 
tasks. XFS scales poorly, and also suffers from read starvation at the 
high end, sometimes taking tens of seconds to cat a file or minutes to 
list a directory. Ext4 and Tux3 exhibit no such issues, remaining 
completely responsive under all tested loads. The bottom line for this 
test is, Tux3 is twice as fast at fsync with a modest task count, and 
the gap widens to nine times faster than its nearest competitor as task 
count increases.

Is there any practical use for fast parallel fsync of tens of thousands
of tasks? This could be useful for a scalable transaction server that 
sits directly on the filesystem instead of a database, as is the 
fashion for big data these days. It certainly can't hurt to know that 
if you need that kind of scaling, Tux3 will do it.

Of course, a pure fsync load could be viewed as somewhat unnatural. We
also need to know what happens under a realistic load with buffered
operations mixed with fsyncs. We turn to an old friend, dbench:

Dbench -t10

    Tasks:       8           16           32
    Ext4:    35.32 MB/s   34.08 MB/s   39.71 MB/s
    XFS:     32.12 MB/s   25.08 MB/s   30.12 MB/s
    Btrfs:   54.40 MB/s   75.09 MB/s  102.81 MB/s
    Tux3:    85.82 MB/s  133.69 MB/s  159.78 MB/s
                   (higher is better)

Tux3 and Btrfs scale well and are way ahead of Ext4 and XFS, while Ext4 
and XFS scale poorly or even negatively. Tux3 is the leader by a wide
margin, beating XFS by more than a factor of 5 at the high end.

Dbench -t10 -s (all file operations synchronous)

    Tasks:       8           16           32
    Ext4:     4.51 MB/s    6.25 MB/s    7.72 MB/s
    XFS:      4.24 MB/s    4.77 MB/s    5.15 MB/s
    Btrfs:    7.98 MB/s   13.87 MB/s   22.87 MB/s
    Tux3:    15.41 MB/s   25.56 MB/s   39.15 MB/s
                   (higher is better)

With a pure synchronous load (O_SYNC) the ranking is not changed but the 
gaps widen, and Tux3 outperforms XFS by a factor of 7.5.

At the risk of overgeneralizing, a trend seems to be emerging: the new, 
write-anywhere designs run synchronous operations faster, combine 
synchronous and asynchronous operations more efficiently, and scale 
better to high task counts than the traditional journaling designs. If 
there is to be an Ext5, it might be worth considering the merit of 
abandoning the journal in favor of something along the lines of Tux3's 
redirect-on-write and logging combination.

Getting back to Ted's question, perhaps an asynchronous design really is 
a better idea all round, even for synchronous operations, and perhaps 
there really is such a thing as an improved design that is not just a 
different set of tradeoffs.

In the full disclosure department, Tux3 is still not properly optimized 
in some areas. One of them is fragmentation: it is not very hard to 
make Tux3 slow down by running long tests. Our current allocation 
algorithm is completely naive - it just allocates the next available 
block and wraps at the top of volume. After a few wraps, it makes a big 
mess. So today we are not claiming victory in the benchmark department, 
we still have some work to do. Today is just about fsync, for which it 
is fair to say that Tux3 sets a new performance standard.

Regards,

Daniel

Footnote: While I was doing this work, Hirofumi set about improving our 
full filesystem sync to support merging parallel full syncs into single 
delta commits. That one change improved fsync performance so much that 
I almost abandoned my separate fsync work. However, a true, separate 
fsync with aggressive group commit eventually proved superior, so now we 
have both: a high performance fsync and a full filesystem sync that is 
nearly as fast under many loads.

/*
 * syncs.c
 * 
 * D.R. Phillips, 2015
 * 
 * To build: c99 -Wall syncs.c -o syncs
 * To run: ./syncs [<filename> [<syncs> [<tasks>]]]
 */

#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <fcntl.h>
#include <sys/wait.h>
#include <errno.h>
#include <sys/stat.h>

char text[1024] = { "hello world!\n" };

int main(int argc, const char *argv[]) {
	const char *basename = argc < 1 ? "foo" : argv[1];
	char name[100];
	int steps = argc < 3 ? 1 : atoi(argv[2]);
	int tasks = argc < 4 ? 1 : atoi(argv[3]);
	int err, fd;

	for (int t = 0; t < tasks; t++) {
		snprintf(name, sizeof name, "%s%i", basename, t);
		if (!fork())
			goto child;
	}
	for (int t = 0; t < tasks; t++)
		wait(&err);
	return 0;

child:
	fd = creat(name, S_IRWXU);
	for (int i = 0; i < steps; i++) {
		write(fd, text, sizeof text);
		fsync(fd);
	}
	return 0;
}