Tux3 Report: How fast can we fsync?

Wed Apr 29 18:46:16 PDT 2015

On Tue, Apr 28, 2015 at 04:13:18PM -0700, Daniel Phillips wrote:
> Greetings,
> 
> This post is dedicated to Ted, who raised doubts a while back about
> whether Tux3 can ever have a fast fsync:
> 
>   https://lkml.org/lkml/2013/5/11/128
>   "Re: Tux3 Report: Faster than tmpfs, what?"

[snip]

> I measured fsync performance using a 7200 RPM disk as a virtual
> drive under KVM, configured with cache=none so that asynchronous
> writes are cached and synchronous writes translate into direct
> writes to the block device.

Yup, a slow single spindle, so fsync performance is determined by
seek latency of the filesystem. Hence the filesystem that "wins"
will be the filesystem that minimises fsync seek latency above all
other considerations.

http://www.spinics.net/lists/kernel/msg1978216.html

So, to demonstrate, I'll run the same tests but using a 256GB
samsung 840 EVO SSD and show how much the picture changes. I didn't
test tux3, you don't make it easy to get or build.

> To focus purely on fsync, I wrote a
> small utility (at the end of this post) that forks a number of
> tasks, each of which continuously appends to and fsyncs its own
> file. For a single task doing 1,000 fsyncs of 1K each, we have:
> 
>    Ext4:  34.34s
>    XFS:   23.63s
>    Btrfs: 34.84s
>    Tux3:  17.24s

   Ext4:   1.94s
   XFS:    2.06s
   Btrfs:  2.06s

All equally fast, so I can't see how tux3 would be much faster here.

> Things get more interesting with parallel fsyncs. In this test, each
> task does ten fsyncs and task count scales from ten to ten thousand.
> We see that all tested filesystems are able to combine fsyncs into
> group commits, with varying degrees of success:
> 
>    Tasks:   10      100    1,000    10,000
>    Ext4:   0.79s   0.98s    4.62s    61.45s
>    XFS:    0.75s   1.68s   20.97s   238.23s
>    Btrfs   0.53s   0.78s    3.80s    84.34s
>    Tux3:   0.27s   0.34s    1.00s     6.86s

   Tasks:   10      100    1,000    10,000
   Ext4:   0.05s   0.12s    0.48s     3.99s
   XFS:    0.25s   0.41s    0.96s     4.07s
   Btrfs   0.22s   0.50s    2.86s   161.04s
             (lower is better)

Ext4 and XFS are fast and show similar performance. Tux3 *can't* be
very much faster as most of the elapsed time in the test is from
forking the processes that do the IO and fsyncs.

FWIW, btrfs shows it's horrible fsync implementation here, burning
huge amounts of CPU to do bugger all IO. i.e. it burnt all 16p for 2
and a half minutes in that 10000 fork test so wasn't IO bound at
all.

> Is there any practical use for fast parallel fsync of tens of thousands
> of tasks? This could be useful for a scalable transaction server
> that sits directly on the filesystem instead of a database, as is
> the fashion for big data these days. It certainly can't hurt to know
> that if you need that kind of scaling, Tux3 will do it.

Ext4 and XFS already do that just fine, too, when you use storage
suited to such a workload and you have a sane interface for
submitting tens of thousands of concurrent fsync operations. e.g

http://oss.sgi.com/archives/xfs/2014-06/msg00214.html

> Of course, a pure fsync load could be viewed as somewhat unnatural. We
> also need to know what happens under a realistic load with buffered
> operations mixed with fsyncs. We turn to an old friend, dbench:
> 
> Dbench -t10
> 
>    Tasks:       8           16           32
>    Ext4:    35.32 MB/s   34.08 MB/s   39.71 MB/s
>    XFS:     32.12 MB/s   25.08 MB/s   30.12 MB/s
>    Btrfs:   54.40 MB/s   75.09 MB/s  102.81 MB/s
>    Tux3:    85.82 MB/s  133.69 MB/s  159.78 MB/s
>                   (higher is better)

On a SSD (256GB samsung 840 EVO), running 4.0.0:

   Tasks:       8           16           32
   Ext4:    598.27 MB/s    981.13 MB/s 1233.77 MB/s
   XFS:     884.62 MB/s   1328.21 MB/s 1373.66 MB/s
   Btrfs:   201.64 MB/s    137.55 MB/s  108.56 MB/s

dbench looks *very different* when there is no seek latency,
doesn't it?

> Dbench -t10 -s (all file operations synchronous)
> 
>    Tasks:       8           16           32
>    Ext4:     4.51 MB/s    6.25 MB/s    7.72 MB/s
>    XFS:      4.24 MB/s    4.77 MB/s    5.15 MB/s
>    Btrfs:    7.98 MB/s   13.87 MB/s   22.87 MB/s
>    Tux3:    15.41 MB/s   25.56 MB/s   39.15 MB/s
>                   (higher is better)

    Ext4:   173.54 MB/s  294.41 MB/s  424.11 MB/s
    XFS:    172.98 MB/s  342.78 MB/s  458.87 MB/s
    Btrfs:   36.92 MB/s   34.52 MB/s   55.19 MB/s

Again, the numbers are completely the other way around on a SSD,
with the conventional filesystems being 5-10x faster than the
WA/COW style filesystem.

....

> In the full disclosure department, Tux3 is still not properly
> optimized in some areas. One of them is fragmentation: it is not
> very hard to make Tux3 slow down by running long tests. Our current

Oh, that still hasn't been fixed?

Until you sort of how you are going to scale allocation to tens of
TB and not fragment free space over time, fsync performance of the
filesystem is pretty much irrelevant. Changing the allocation
algorithms will fundamentally alter the IO patterns and so all these
benchmarks are essentially meaningless.

Cheers,

Dave.
-- 
Dave Chinner
david at fromorbit.com