Tux3 Report: How fast can we fsync?

Fri May 1 08:38:55 PDT 2015

On Thu, Apr 30, 2015 at 03:28:13AM -0700, Daniel Phillips wrote:
> On Wednesday, April 29, 2015 6:46:16 PM PDT, Dave Chinner wrote:
> >>I measured fsync performance using a 7200 RPM disk as a virtual
> >>drive under KVM, configured with cache=none so that asynchronous
> >>writes are cached and synchronous writes translate into direct
> >>writes to the block device.
> >
> >Yup, a slow single spindle, so fsync performance is determined by
> >seek latency of the filesystem. Hence the filesystem that "wins"
> >will be the filesystem that minimises fsync seek latency above
> >all other considerations.
> >
> >http://www.spinics.net/lists/kernel/msg1978216.html
> 
> If you want to declare that XFS only works well on solid state
> disks and big storage arrays, that is your business. But if you
> do, you can no longer call XFS a general purpose filesystem. And

Well, yes - I never claimed XFS is a general purpose filesystem.  It
is a high performance filesystem. Is is also becoming more relevant
to general purpose systems as low cost storage gains capabilities
that used to be considered the domain of high performance storage...

> >So, to demonstrate, I'll run the same tests but using a 256GB
> >samsung 840 EVO SSD and show how much the picture changes.
> 
> I will go you one better, I ran a series of fsync tests using
> tmpfs, and I now have a very clear picture of how the picture
> changes. The executive summary is: Tux3 is still way faster, and
> still scales way better to large numbers of tasks. I have every
> confidence that the same is true of SSD.

/dev/ramX can't be compared to an SSD.  Yes, they both have low
seek/IO latency but they have very different dispatch and IO
concurrency models.  One is synchronous, the other is fully
asynchronous.

This is an important distinction, as we'll see later on....

> >I didn't test tux3, you don't make it easy to get or build.
> 
> There is no need to apologize for not testing Tux3, however, it is
> unseemly to throw mud at the same time. Remember, you are the

These trees:

git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3.git
git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3-test.git

have not been updated for 11 months. I thought tux3 had died long
ago.

You should keep them up to date, and send patches for xfstests to
support tux3, and then you'll get a lot more people running,
testing and breaking tux3....

> >>To focus purely on fsync, I wrote a
> >>small utility (at the end of this post) that forks a number of
> >>tasks, each of which continuously appends to and fsyncs its own
> >>file. For a single task doing 1,000 fsyncs of 1K each, we have:
.....
> >All equally fast, so I can't see how tux3 would be much faster here.
> 
> Running the same thing on tmpfs, Tux3 is significantly faster:
> 
>     Ext4:   1.40s
>     XFS:    1.10s
>     Btrfs:  1.56s
>     Tux3:   1.07s

3% is not "signficantly faster". It's within run to run variation!

> >   Tasks:   10      100    1,000    10,000
> >   Ext4:   0.05s   0.12s    0.48s     3.99s
> >   XFS:    0.25s   0.41s    0.96s     4.07s
> >   Btrfs   0.22s   0.50s    2.86s   161.04s
> >             (lower is better)
> >
> >Ext4 and XFS are fast and show similar performance. Tux3 *can't* be
> >very much faster as most of the elapsed time in the test is from
> >forking the processes that do the IO and fsyncs.
> 
> You wish. In fact, Tux3 is a lot faster.

Yes, it's easy to be fast when you have simple, naive algorithms and
an empty filesystem.

> triple checked and reproducible:
> 
>    Tasks:   10      100    1,000    10,000
>    Ext4:   0.05     0.14    1.53     26.56
>    XFS:    0.05     0.16    2.10     29.76
>    Btrfs:  0.08     0.37    3.18     34.54
>    Tux3:   0.02     0.05    0.18      2.16

Yet I can't reproduce those XFS or ext4 numbers you are quoting
there. eg. XFS on a 4GB ram disk:

$ for i in 10 100 1000 10000; do rm /mnt/test/foo* ; time ./test-fsync /mnt/test/foo 10 $i; done

real    0m0.030s
user    0m0.000s
sys     0m0.014s

real    0m0.031s
user    0m0.008s
sys     0m0.157s

real    0m0.305s
user    0m0.029s
sys     0m1.555s

real    0m3.624s
user    0m0.219s
sys     0m17.631s
$

That's roughly 10x faster than your numbers. Can you describe your
test setup in detail? e.g.  post the full log from block device
creation to benchmark completion so I can reproduce what you are
doing exactly?

> Note: you should recheck your final number for Btrfs. I have seen
> Btrfs fall off the rails and take wildly longer on some tests just
> like that.

Completely reproducable:

$ sudo mkfs.btrfs -f /dev/vdc
Btrfs v3.16.2
See http://btrfs.wiki.kernel.org for more information.

Turning ON incompat feature 'extref': increased hardlink limit per file to 65536
fs created label (null) on /dev/vdc
        nodesize 16384 leafsize 16384 sectorsize 4096 size 500.00TiB
$ sudo mount /dev/vdc /mnt/test
$ sudo chmod 777 /mnt/test
$ for i in 10 100 1000 10000; do rm /mnt/test/foo* ; time ./test-fsync /mnt/test/foo 10 $i; done

real    0m0.068s
user    0m0.000s
sys     0m0.061s

real    0m0.563s
user    0m0.001s
sys     0m2.047s

real    0m2.851s
user    0m0.040s
sys     0m24.503s

real    2m38.713s
user    0m0.533s
sys     38m34.831s

Same result - ~160s burning all 16 CPUs, as can be seen by the
system time.

And even on a 4GB ram disk, the 10000 process test comes in at:

real    0m35.567s
user    0m0.707s
sys     6m1.922s

That's the same wall time as your tst, but the CPU burn on my
machine is still clearly evident. You indicated that it's not doing
this on your machine, so I don't think we can really use btfrs
numbers for comparison purposes if it is behaving so differently on
different machines....

[snip]

> One easily reproducible one is a denial of service
> during the 10,000 task test where it takes multiple seconds to cat
> small files. I saw XFS do this on both spinning disk and tmpfs, and
> I have seen it hang for minutes trying to list a directory. I looked
> a bit into it, and I see that you are blocking for aeons trying to
> acquire a lock in open.

Yes, that's the usual case when XFS is waiting on buffer readahead
IO completion. The latency of which is completely determined by
block layer queuing and scheduling behaviour. And the block device
queue is being dominated by the 10,000 concurrent write processes
you just ran.....

"Doctor, it hurts when I do this!"

[snip]

> You and I both know the truth: Ext4 is the only really reliable
> general purpose filesystem on Linux at the moment.

BWAHAHAHAHAHAHAH-*choke*

*cough*

*cough*

/me wipes tears from his eyes

That's the funniest thing I've read in a long time :)

[snip]

> >On a SSD (256GB samsung 840 EVO), running 4.0.0:
> >
> >   Tasks:       8           16           32
> >   Ext4:    598.27 MB/s    981.13 MB/s 1233.77 MB/s
> >   XFS:     884.62 MB/s   1328.21 MB/s 1373.66 MB/s
> >   Btrfs:   201.64 MB/s    137.55 MB/s  108.56 MB/s
> >
> >dbench looks *very different* when there is no seek latency,
> >doesn't it?
> 
> It looks like Btrfs hit a bug, not a huge surprise. Btrfs hit an assert
> for me earlier this evening. It is rare but it happens. I rebooted
> and got sane numbers. Running dbench -t10 on tmpfs I get:
> 
>     Tasks:       8            16            32
>     Ext4:    660.69 MB/s   708.81 MB/s   720.12 MB/s
>     XFS:     692.01 MB/s   388.53 MB/s   134.84 MB/s
>     Btrfs:   229.66 MB/s   341.27 MB/s   377.97 MB/s
>     Tux3:   1147.12 MB/s  1401.61 MB/s  1283.74 MB/s
> 
> Looks like XFS hit a bump and fell off the cliff at 32 threads. I reran
> that one many times because I don't want to give you an inaccurate
> report.

I can't reproduce those numbers, either. On /dev/ram0:

    Tasks:       8            16            32
    Ext4:    1416.11 MB/s   1585.81 MB/s   1406.18 MB/s
    XFS:     2580.58 MB/s   1367.48 MB/s    994.46 MB/s
    Btrfs:    151.89 MB/s     84.88 MB/s     73.16 MB/s

Still, that negative XFS scalability shouldn't be occuring - it
should be level off and be much flatter if everything is working
correctly.

<ding>

Ah.

Ram disks and synchronous IO.....

The XFS journal a completely asynchronous IO engine and the
synchronous IO done by the ram disk really screws with the
concurrency model. There are journal write aggregation optimisations
that are based on the "buffer under IO" state detection, which is
completely skipped when journal IO is synchronous and completed in
the submission context. This problem doesn't occur on actual storage
devices where IO is asynchronous.

So, yes, dbench can trigger an interesting behaviour in XFS, but
it's well understood and doesn't actually effect normal storage
devices. If you need a volatile fileystem for performance
reasons then tmpfs is what you want, not XFS....

[
	Feel free to skip the detail:

	Let's go back to that SSD, which does asynchronous IO and so
	the journal to operates fully asynchronously:

	$ for i in 8 16 32 64 128 256; do dbench -t10  $i -D /mnt/test; done
	Throughput 811.806 MB/sec  8 clients  8 procs  max_latency=12.152 ms
	Throughput 1285.47 MB/sec  16 clients  16 procs  max_latency=22.880 ms
	Throughput 1516.22 MB/sec  32 clients  32 procs  max_latency=73.381 ms
	Throughput 1724.57 MB/sec  64 clients  64 procs  max_latency=256.681 ms
	Throughput 2046.91 MB/sec  128 clients  128 procs max_latency=1068.169 ms
	Throughput 1895.4 MB/sec  256 clients  256 procs max_latency=3157.738 ms

	So performance improves out to 128 processes and then the
	SSD runs out of capacity - it's doing >400MB/s write IO at
	128 clients. That makes latency blow out as we add more
	load, so it doesn't go any faster and we start to back up on
	the log. Hence we slowly start to go backwards as client
	count continues to increase and contention builds up on
	global wait queues.

	Now, XFS has 8 log buffer and so can issue 8 concurrent
	journal writes. Let's run dbench with fewer processes on a
	ram disk, and see what happens as we increase the number of
	processes doing IO and hence triggering journal writes:

	$ for i in 1 2 4 6 8; do dbench -t10  $i -D /mnt/test |grep Throughput; done
	Throughput 653.163 MB/sec  1 clients  1 procs  max_latency=0.355 ms
	Throughput 1273.65 MB/sec  2 clients  2 procs  max_latency=3.947 ms
	Throughput 2189.19 MB/sec  4 clients  4 procs  max_latency=7.582 ms
	Throughput 2318.33 MB/sec  6 clients  6 procs  max_latency=8.023 ms
	Throughput 2212.85 MB/sec  8 clients  8 procs  max_latency=9.120 ms

	Yeah, ok, we scale out to 4 processes, then level off.
	That's going to be limited by allocation concurrency during
	writes, not the journal (the default is 4 AGs on a
	filesystem so small). Let's make 16 AGs, cause seeks don't
	matter on a ram disk.

	$ sudo mkfs.xfs -f -d agcount=16 /dev/ram0
	....
	$ for i in 1 2 4 6 8; do dbench -t10  $i -D /mnt/test |grep Throughput; done
	Throughput 656.189 MB/sec  1 clients  1 procs  max_latency=0.565 ms
	Throughput 1277.25 MB/sec  2 clients  2 procs  max_latency=3.739 ms
	Throughput 2350.73 MB/sec  4 clients  4 procs  max_latency=5.126 ms
	Throughput 2754.3 MB/sec  6 clients  6 procs  max_latency=8.063 ms
	Throughput 3135.11 MB/sec  8 clients  8 procs  max_latency=6.746 ms

	Yup, as expected the we continue to increase performance out
	to 8 processes now that there isn't an allocation
	concurrency limit being hit.

	What happens as we pass 8 processes now?

	$ for i in 4 8 12 16; do dbench -t10  $i -D /mnt/test |grep Throughput; done
	Throughput 2277.53 MB/sec  4 clients  4 procs  max_latency=5.778 ms
	Throughput 3070.3 MB/sec  8 clients  8 procs  max_latency=7.808 ms
	Throughput 2555.29 MB/sec  12 clients  12 procs  max_latency=8.518 ms
	Throughput 1868.96 MB/sec  16 clients  16 procs  max_latency=14.193 ms
	$

	As expected, past 8 processes perform tails off because the
	journal state machine is not scheduling after dispatch of
	the journal IO and hence allowing other threads to aggregate
	journal writes into the next active log buffer because there
	is no "under IO" stage in the state machine to it to trigger
	log write aggregation delays off.

	I'd completely forgotten about this - I discovered it 3 or 4
	years ago, and then simply stopped using ramdisks for
	performance testing because I could get better performance
	from XFS on highly concurrent workloads from real storage.
]

> >>Dbench -t10 -s (all file operations synchronous)
> >>
> >>   Tasks:       8           16           32
> >>   Ext4:     4.51 MB/s    6.25 MB/s    7.72 MB/s
> >>   XFS:      4.24 MB/s    4.77 MB/s    5.15 MB/s
> >>   Btrfs:    7.98 MB/s   13.87 MB/s   22.87 MB/s
> >>   Tux3:    15.41 MB/s   25.56 MB/s   39.15 MB/s
> >>                  (higher is better)
> >
> >    Ext4:   173.54 MB/s  294.41 MB/s  424.11 MB/s
> >    XFS:    172.98 MB/s  342.78 MB/s  458.87 MB/s
> >    Btrfs:   36.92 MB/s   34.52 MB/s   55.19 MB/s
> >
> >Again, the numbers are completely the other way around on a SSD,
> >with the conventional filesystems being 5-10x faster than the
> >WA/COW style filesystem.
> 
> I wouldn't be so sure about that...
> 
>     Tasks:       8            16            32
>     Ext4:     93.06 MB/s    98.67 MB/s   102.16 MB/s
>     XFS:      81.10 MB/s    79.66 MB/s    73.27 MB/s
>     Btrfs:    43.77 MB/s    64.81 MB/s    90.35 MB/s
>     Tux3:    198.49 MB/s   279.00 MB/s   318.41 MB/s

     Ext4:     807.21 MB/s    1089.89 MB/s   867.55 MB/s
     XFS:      997.77 MB/s    1011.51 MB/s   876.49 MB/s
     Btrfs:     55.66 MB/s      56.77 MB/s    60.30 MB/s

Numbers are again very different for XFS and ext4 on /dev/ramX on my
system. Need to work out why yours are so low....

> >Until you sort of how you are going to scale allocation to tens of
> >TB and not fragment free space over time, fsync performance of the
> >filesystem is pretty much irrelevant. Changing the allocation
> >algorithms will fundamentally alter the IO patterns and so all these
> >benchmarks are essentially meaningless.
> 
> Ahem, are you the same person for whom fsync was the most important
> issue in the world last time the topic came up, to the extent of
> spreading around FUD and entirely ignoring the great work we had
> accomplished for regular file operations?

Actually, I don't remember any discussions about fsync.

Things I remember that needed addressing are:
	- the lack of ENOSPC detection
	- the writeback integration issues
	- the code cleanliness issues (ifdef mess, etc)
	- the page forking design problems
	- the lack of scalable inode and space allocation
	  algorithms.

Those are the things I remember, and fsync performance pales in
comparison to those.

> I said then that when we
> got around to a proper fsync it would be competitive. Now here it
> is, so you want to change the topic. I understand.

I haven't changed the topic, just the storage medium. The simple
fact is that the world is moving away from slow sata storage at a
pretty rapid pace and it's mostly going solid state. Spinning disks
also changing - they are going to ZBC based SMR, which is a
compeltely different problem space which doesn't even appear to be
on the tux3 radar....

So where does tux3 fit into a storage future of byte addressable
persistent memory and ZBC based SMR devices?

Cheers,

Dave.
-- 
Dave Chinner
david at fromorbit.com