Tux3 Report: How fast can we fsync?

Fri May 1 18:07:48 PDT 2015

On Fri, 1 May 2015, Daniel Phillips wrote:

> On Friday, May 1, 2015 8:38:55 AM PDT, Dave Chinner wrote:
>>
>> Well, yes - I never claimed XFS is a general purpose filesystem.  It
>> is a high performance filesystem. Is is also becoming more relevant
>> to general purpose systems as low cost storage gains capabilities
>> that used to be considered the domain of high performance storage...
>
> OK. Well, Tux3 is general purpose and that means we care about single
> spinning disk and small systems.

keep in mind that if you optimize only for the small systems you may not scale 
as well to the larger ones.

>>>> So, to demonstrate, I'll run the same tests but using a 256GB
>>>> samsung 840 EVO SSD and show how much the picture changes.
>>>
>>> I will go you one better, I ran a series of fsync tests using
>>> tmpfs, and I now have a very clear picture of how the picture
>>> changes. The executive summary is: Tux3 is still way faster, and
>>> still scales way better to large numbers of tasks. I have every
>>> confidence that the same is true of SSD.
>>
>> /dev/ramX can't be compared to an SSD.  Yes, they both have low
>> seek/IO latency but they have very different dispatch and IO
>> concurrency models.  One is synchronous, the other is fully
>> asynchronous.
>
> I had ram available and no SSD handy to abuse. I was interested in
> measuring the filesystem overhead with the device factored out. I
> mounted loopback on a tmpfs file, which seems to be about the same as
> /dev/ram, maybe slightly faster, but much easier to configure. I ran
> some tests on a ramdisk just now and was mortified to find that I have
> to reboot to empty the disk. It would take a compelling reason before
> I do that again.
>
>> This is an important distinction, as we'll see later on....
>
> I regard it as predictive of Tux3 performance on NVM.

per the ramdisk but, possibly not as relavent as you may think. This is why it's 
good to test on as many different systems as you can. As you run into different 
types of performance you can then pick ones to keep and test all the time.

Single spinning disk is interesting now, but will be less interesting later. 
multiple spinning disks in an array of some sort is going to remain very 
interesting for quite a while.

now, some things take a lot more work to test than others. Getting time on a 
system with a high performance, high capacity RAID is hard, but getting hold of 
an SSD from Fry's is much easier. If it's a budget item, ping me directly and I 
can donate one for testing (the cost of a drive is within my unallocated budget 
and using that to improve Linux is worthwhile)

>>> Running the same thing on tmpfs, Tux3 is significantly faster:
>>>
>>>     Ext4:   1.40s
>>>     XFS:    1.10s
>>>     Btrfs:  1.56s
>>>     Tux3:   1.07s
>>
>> 3% is not "signficantly faster". It's within run to run variation!
>
> You are right, XFS and Tux3 are within experimental error for single
> syncs on the ram disk, while Ext4 and Btrfs are way slower:
>
>       Ext4:   1.59s
>       XFS:    1.11s
>       Btrfs:  1.70s
>       Tux3:   1.11s
>
> A distinct performance gap appears between Tux3 and XFS as parallel
> tasks increase.

It will be interesting to see if this continues to be true on more systems. I 
hope it does.

>>> You wish. In fact, Tux3 is a lot faster. ...
>>
>> Yes, it's easy to be fast when you have simple, naive algorithms and
>> an empty filesystem.
>
> No it isn't or the others would be fast too. In any case our algorithms
> are far from naive, except for allocation. You can rest assured that
> when allocation is brought up to a respectable standard in the fullness
> of time, it will be competitive and will not harm our clean filesystem
> performance at all.
>
> There is no call for you to disparage our current achievements, which
> are significant. I do not mind some healthy skepticism about the
> allocation work, you know as well as anyone how hard it is. However your
> denial of our current result is irritating and creates the impression
> that you have an agenda. If you want to complain about something real,
> complain that our current code drop is not done yet. I will humbly
> apologize, and the same for enospc.

As I'm reading Dave's comments, he isn't attacking you the way you seem to think 
he is. He is pointing ot that there are problems with your data, but he's also 
taking a lot of time to explain what's happening (and yes, some of this is 
probably because your simple tests with XFS made it look so bad)

the other filesystems don't use naive algortihms, they use something more 
complex, and while your current numbers are interesting, they are only 
preliminary until you add something to handle fragmentation. That can cause very 
significant problems. Remember how fabulous btrfs looked in the initial reports? 
and then corner cases were found that caused real problems and as the algorithms 
have been changed to prevent those corner cases from being so easy to hit, the 
common case has suffered somewhat. This isn't an attack on Tux2 or btrfs, it's 
just a reality of programming. If you are not accounting for all the corner 
cases, everything is easier, and faster.

>> That's roughly 10x faster than your numbers. Can you describe your
>> test setup in detail? e.g.  post the full log from block device
>> creation to benchmark completion so I can reproduce what you are
>> doing exactly?
>
> Mine is a lame i5 minitower with 4GB from Fry's. Yours is clearly way
> more substantial, so I can't compare my numbers directly to yours.

If you are doing tests with a 4G ramdisk on a machine with only 4G of RAM, it 
seems like you end up testing a lot more than just the filesystem. Testing in 
such low memory situations can indentify significant issues, but it is 
questionable as a 'which filesystem is better' benchmark.

> Clearly the curve is the same: your numbers increase 10x going from 100
> to 1,000 tasks and 12x going from 1,000 to 10,000. The Tux3 curve is
> significantly flatter and starts from a lower base, so it ends with a
> really wide gap. You will need to take my word for that for now. I
> promise that the beer is on me should you not find that reproducible.
>
> The repository delay is just about not bothering Hirofumi for a merge
> while he finishes up his inode table anti-fragmentation work.

Just a suggestion, but before you do a huge post about how great your filesystem 
is performing, making the code avaialble so that others can test it when 
prompted by your post is probably a very good idea. If it means that you have to 
send out your post a week later, it's a very small cost for the benefit of 
having other people able to easily try it on hardware that you don't have access 
to.

If there is a reason to post wihtout the code being in the main, publicised 
repo, then your post should point people at what code they can use to duplicate 
it.

but really, 11 months without updating the main repo?? This is Open Source 
development, publish early and often.

>>> Note: you should recheck your final number for Btrfs. I have seen
>>> Btrfs fall off the rails and take wildly longer on some tests just
>>> like that.
>>
>> Completely reproducable...
>
> I believe you. I found that Btrfs does that way too much. So does XFS
> from time to time, when it gets up into lots of tasks. Read starvation
> on XFS is much worse than Btrfs, and XFS also exhibits some very
> undesirable behavior with initial file create. Note: Ext4 and Tux3 have
> roughly zero read starvation in any of these tests, which pretty much
> proves it is not just a block scheduler thing. I don't think this is
> something you should dismiss.

something to investigate, but I have seen probelms on ext* in the past. ext4 may 
have fixed this, or it may just have moved the point where it triggers.

>>> I wouldn't be so sure about that...
>>>
>>>     Tasks:       8            16            32
>>>     Ext4:     93.06 MB/s    98.67 MB/s   102.16 MB/s
>>>     XFS:      81.10 MB/s    79.66 MB/s    73.27 MB/s
>>>     Btrfs:    43.77 MB/s    64.81 MB/s    90.35 MB/s ...
>>
>>      Ext4:     807.21 MB/s    1089.89 MB/s   867.55 MB/s
>>      XFS:      997.77 MB/s    1011.51 MB/s   876.49 MB/s
>>      Btrfs:     55.66 MB/s      56.77 MB/s    60.30 MB/s
>>
>> Numbers are again very different for XFS and ext4 on /dev/ramX on my
>> system. Need to work out why yours are so low....
>
> Your machine makes mine look like a PCjr.

The interesting thing here is that on the faster machine btrfs didn't speed up 
significantly while ext4 and xfs did. It will be interesting to see what the 
results are for tux3

and both of you need to remember that while servers are getting faster, we are 
also seeing much lower power, weaker servers showing up as well. And while these 
smaller servers are not trying to do teh 10000 thread fsync workload, they are 
using flash based storage more frequently than they are spinning rust 
(frequently through the bottleneck of a SD card) so continuing tests on low end 
devices is good.

>>> I said then that when we
>>> got around to a proper fsync it would be competitive. Now here it
>>> is, so you want to change the topic. I understand.
>>
>> I haven't changed the topic, just the storage medium. The simple
>> fact is that the world is moving away from slow sata storage at a
>> pretty rapid pace and it's mostly going solid state. Spinning disks
>> also changing - they are going to ZBC based SMR, which is a
>> compeltely different problem space which doesn't even appear to be
>> on the tux3 radar....
>>
>> So where does tux3 fit into a storage future of byte addressable
>> persistent memory and ZBC based SMR devices?
>
> You won't convince us to abandon spinning rust, it's going to be around
> a lot longer than you think. Obviously, we care about SSD and I believe
> you will find that Tux3 is more than competitive there. We lay things
> out in a very erase block friendly way. We need to address the volume
> wrap issue of course, and that is in progress. This is much easier than
> spinning disk.
>
> Tux3's redirect-on-write[1] is obviously a natural for SMR, however
> I will not get excited about it unless a vendor waves money.

what drives are available now? see if you can get a couple (either directly or 
donated)

David Lang