Tux3 Report: How fast can we fsync?

Fri May 1 16:20:54 PDT 2015

On Friday, May 1, 2015 8:38:55 AM PDT, Dave Chinner wrote:
>
> Well, yes - I never claimed XFS is a general purpose filesystem.  It
> is a high performance filesystem. Is is also becoming more relevant
> to general purpose systems as low cost storage gains capabilities
> that used to be considered the domain of high performance storage...

OK. Well, Tux3 is general purpose and that means we care about single
spinning disk and small systems.

>>> So, to demonstrate, I'll run the same tests but using a 256GB
>>> samsung 840 EVO SSD and show how much the picture changes.
>>
>> I will go you one better, I ran a series of fsync tests using
>> tmpfs, and I now have a very clear picture of how the picture
>> changes. The executive summary is: Tux3 is still way faster, and
>> still scales way better to large numbers of tasks. I have every
>> confidence that the same is true of SSD.
>
> /dev/ramX can't be compared to an SSD.  Yes, they both have low
> seek/IO latency but they have very different dispatch and IO
> concurrency models.  One is synchronous, the other is fully
> asynchronous.

I had ram available and no SSD handy to abuse. I was interested in
measuring the filesystem overhead with the device factored out. I
mounted loopback on a tmpfs file, which seems to be about the same as
/dev/ram, maybe slightly faster, but much easier to configure. I ran
some tests on a ramdisk just now and was mortified to find that I have
to reboot to empty the disk. It would take a compelling reason before
I do that again.

> This is an important distinction, as we'll see later on....

I regard it as predictive of Tux3 performance on NVM.

> These trees:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3.git
> git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3-test.git
>
> have not been updated for 11 months. I thought tux3 had died long
> ago.
>
> You should keep them up to date, and send patches for xfstests to
> support tux3, and then you'll get a lot more people running,
> testing and breaking tux3....

People are starting to show up to do testing now, pretty much the first
time, so we must do some housecleaning. It is gratifying that Tux3 never
broke for Mike, but of course it will assert just by running out of
space at the moment. As you rightly point out, that fix is urgent and is
my current project.

>> Running the same thing on tmpfs, Tux3 is significantly faster:
>>
>>     Ext4:   1.40s
>>     XFS:    1.10s
>>     Btrfs:  1.56s
>>     Tux3:   1.07s
>
> 3% is not "signficantly faster". It's within run to run variation!

You are right, XFS and Tux3 are within experimental error for single
syncs on the ram disk, while Ext4 and Btrfs are way slower:

       Ext4:   1.59s
       XFS:    1.11s
       Btrfs:  1.70s
       Tux3:   1.11s

A distinct performance gap appears between Tux3 and XFS as parallel
tasks increase.

>> You wish. In fact, Tux3 is a lot faster. ...
>
> Yes, it's easy to be fast when you have simple, naive algorithms and
> an empty filesystem.

No it isn't or the others would be fast too. In any case our algorithms
are far from naive, except for allocation. You can rest assured that
when allocation is brought up to a respectable standard in the fullness
of time, it will be competitive and will not harm our clean filesystem
performance at all.

There is no call for you to disparage our current achievements, which
are significant. I do not mind some healthy skepticism about the
allocation work, you know as well as anyone how hard it is. However your
denial of our current result is irritating and creates the impression
that you have an agenda. If you want to complain about something real,
complain that our current code drop is not done yet. I will humbly
apologize, and the same for enospc.

>> triple checked and reproducible:
>>
>>    Tasks:   10      100    1,000    10,000
>>    Ext4:   0.05     0.14    1.53     26.56
>>    XFS:    0.05     0.16    2.10     29.76
>>    Btrfs:  0.08     0.37    3.18     34.54
>>    Tux3:   0.02     0.05    0.18      2.16
>
> Yet I can't reproduce those XFS or ext4 numbers you are quoting
> there. eg. XFS on a 4GB ram disk:
>
> $ for i in 10 100 1000 10000; do rm /mnt/test/foo* ; time
> ./test-fsync /mnt/test/foo 10 $i; done
>
> real    0m0.030s
> user    0m0.000s
> sys     0m0.014s
>
> real    0m0.031s
> user    0m0.008s
> sys     0m0.157s
>
> real    0m0.305s
> user    0m0.029s
> sys     0m1.555s
>
> real    0m3.624s
> user    0m0.219s
> sys     0m17.631s
> $
>
> That's roughly 10x faster than your numbers. Can you describe your
> test setup in detail? e.g.  post the full log from block device
> creation to benchmark completion so I can reproduce what you are
> doing exactly?

Mine is a lame i5 minitower with 4GB from Fry's. Yours is clearly way
more substantial, so I can't compare my numbers directly to yours.

Clearly the curve is the same: your numbers increase 10x going from 100
to 1,000 tasks and 12x going from 1,000 to 10,000. The Tux3 curve is
significantly flatter and starts from a lower base, so it ends with a
really wide gap. You will need to take my word for that for now. I
promise that the beer is on me should you not find that reproducible.

The repository delay is just about not bothering Hirofumi for a merge
while he finishes up his inode table anti-fragmentation work.

>> Note: you should recheck your final number for Btrfs. I have seen
>> Btrfs fall off the rails and take wildly longer on some tests just
>> like that.
>
> Completely reproducable...

I believe you. I found that Btrfs does that way too much. So does XFS
from time to time, when it gets up into lots of tasks. Read starvation
on XFS is much worse than Btrfs, and XFS also exhibits some very
undesirable behavior with initial file create. Note: Ext4 and Tux3 have
roughly zero read starvation in any of these tests, which pretty much
proves it is not just a block scheduler thing. I don't think this is
something you should dismiss.

>> One easily reproducible one is a denial of service
>> during the 10,000 task test where it takes multiple seconds to cat
>> small files. I saw XFS do this on both spinning disk and tmpfs, and
>> I have seen it hang for minutes trying to list a directory. I looked
>> a bit into it, and I see that you are blocking for aeons trying to
>> acquire a lock in open.
>
> Yes, that's the usual case when XFS is waiting on buffer readahead
> IO completion. The latency of which is completely determined by
> block layer queuing and scheduling behaviour. And the block device
> queue is being dominated by the 10,000 concurrent write processes
> you just ran.....
>
> "Doctor, it hurts when I do this!"

It only hurts XFS (and sometimes Btrfs) when you do that. I believe
your theory is wrong about the cause, or at least Ext4 and Tux3 skirt
that issue somehow. We definitely did not do anything special to avoid
it.

>> You and I both know the truth: Ext4 is the only really reliable
>> general purpose filesystem on Linux at the moment.
>
> That's the funniest thing I've read in a long time :)

I'm glad I could lighten your day, but I remain uncomfortable with the
read starvation issues and the massive long lock holds I see. Perhaps
XFS is stable if you don't push too many tasks at it.

[snipped the interesting ramdisk performance bug hunt]

OK, fair enough, you get a return match on SSD when I get hold of one.

>> I wouldn't be so sure about that...
>>
>>     Tasks:       8            16            32
>>     Ext4:     93.06 MB/s    98.67 MB/s   102.16 MB/s
>>     XFS:      81.10 MB/s    79.66 MB/s    73.27 MB/s
>>     Btrfs:    43.77 MB/s    64.81 MB/s    90.35 MB/s ...
>
>      Ext4:     807.21 MB/s    1089.89 MB/s   867.55 MB/s
>      XFS:      997.77 MB/s    1011.51 MB/s   876.49 MB/s
>      Btrfs:     55.66 MB/s      56.77 MB/s    60.30 MB/s
>
> Numbers are again very different for XFS and ext4 on /dev/ramX on my
> system. Need to work out why yours are so low....

Your machine makes mine look like a PCjr.

>> Ahem, are you the same person for whom fsync was the most important
>> issue in the world last time the topic came up, to the extent of
>> spreading around FUD and entirely ignoring the great work we had
>> accomplished for regular file operations? ...
>
> Actually, I don't remember any discussions about fsync.

Here:

   http://www.spinics.net/lists/linux-fsdevel/msg64825.html
   (Re: Tux3 Report: Faster than tmpfs, what?)

It still rankles that you took my innocent omission of the detail that
Hirofumi had removed the fsyncs from dbench and turned it into a major
FUD attack, casting aspersions on our integrity. We removed the fsyncs
because we weren't interested in measuring something we had not
implemented yet, it is that simple.

That, plus Ted's silly pronouncements that I could not answer at the
time, is what motivated me to design and implement an fsync that would
not just be competitive, but would righteously kick the tails of XFS
and Ext4, which is done. If I were you, I would wait for the code drop,
verify it, and then give credit where credit is due. Then I would
figure out how to make XFS work like that.

> Things I remember that needed addressing are:
> 	- the lack of ENOSPC detection
> 	- the writeback integration issues
> 	- the code cleanliness issues (ifdef mess, etc)
> 	- the page forking design problems
> 	- the lack of scalable inode and space allocation
> 	  algorithms.
>
> Those are the things I remember, and fsync performance pales in
> comparison to those.

With the exception of "page forking design", it is the same list as
ours, with progress on all of them. I freely admit that optimized fsync
was not on the critical path, but you made it an issue so I addressed
it. Anyway, I needed to hone my kernel debugging skills and that worked
out well.

>> I said then that when we
>> got around to a proper fsync it would be competitive. Now here it
>> is, so you want to change the topic. I understand.
>
> I haven't changed the topic, just the storage medium. The simple
> fact is that the world is moving away from slow sata storage at a
> pretty rapid pace and it's mostly going solid state. Spinning disks
> also changing - they are going to ZBC based SMR, which is a
> compeltely different problem space which doesn't even appear to be
> on the tux3 radar....
>
> So where does tux3 fit into a storage future of byte addressable
> persistent memory and ZBC based SMR devices?

You won't convince us to abandon spinning rust, it's going to be around
a lot longer than you think. Obviously, we care about SSD and I believe
you will find that Tux3 is more than competitive there. We lay things
out in a very erase block friendly way. We need to address the volume
wrap issue of course, and that is in progress. This is much easier than
spinning disk.

Tux3's redirect-on-write[1] is obviously a natural for SMR, however
I will not get excited about it unless a vendor waves money.

Regards,

Daniel

[1] Copy-on-write is a misnomer because there is no copy. The proper
term is "redirect-on-write".