xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

Thu Apr 30 04:14:54 PDT 2015

On Wednesday, April 29, 2015 5:20:08 PM PDT, Dave Chinner wrote:
> It's easy to be fast on empty filesystems. XFS does not aim to be
> fast in such situations - it aims to have consistent performance
> across the life of the filesystem.
>
> In this case, ext4, btrfs and tux3 have optimal allocation filling
> from the outside of the disk, while XFS is spreading the files
> across (at least) 4 separate regions of the whole disk. Hence XFS is
> seeing seek times on read are much larger than the other filesystems
> when the filesystem is empty as it is doing full disk seeks rather
> than being confined to the outer edges of spindle.
>
> Thing is, once you've abused those filesytsems for a couple of
> months, the files in ext4, btrfs and tux3 are not going to be laid
> out perfectly on the outer edge of the disk. They'll be spread all
> over the place and so all the filesystems will be seeing large seeks
> on read. The thing is, XFS will have roughly the same performance as
> when the filesystem is empty because the spreading of the allocation
> allows it to maintain better locality and separation and hence
> doesn't fragment free space nearly as badly as the oher filesystems.
> Free space fragmentation is what leads to performance degradation in
> filesystems, and all the other filesystem will have degraded to be
> *much worse* than XFS.
>
> Put simply: empty filesystem benchmarking does not show the real
> performance of the filesystem under sustained production workloads.
> Hence benchmarks like this - while interesting from a theoretical
> point of view and are widely used for bragging about whose got the
> fastest - are mostly irrelevant to determining how the filesystem
> will perform in production environments.
>
> We can also look at this algorithm in a different way: take a large
> filesystem (say a few hundred TB) across a few tens of disks in a
> linear concat.  ext4, btrfs and tux3 will only hit the first disk in
> the concat, and so go no faster because they are still bound by
> physical seek times.  XFS, however, will spread the load across many
> (if not all) of the disks, and so effectively reduce the average
> seek time by the number of disks doing concurrent IO. Then you'll
> see that application level IO concurrency becomes the performance
> limitation, not the physical seek time of the hardware.
>
> IOWs, what you don't see here is that the XFS algorithms that make
> your test slow will keep *lots* of disks busy. i.e. testing empty
> filesystem performance a single, slow disk demonstrates that an
> algorithm designed for scalability isn't designed to acheive
> physical seek distance minimisation.  Hence your storage makes XFS
> look particularly poor in comparison to filesystems that are being
> designed and optimised for the limitations of single slow spindles...
>
> To further demonstrate that it is physical seek distance that is the
> issue here, lets take the seek time out of the equation (e.g. use a
> SSD).  Doing that will result in basically no difference in
> performance between all 4 filesystems as performance will now be
> determined by application level concurrency and that is the same for
> all tests.

Lovely sounding argument, but it is wrong because Tux3 still beats XFS
even with seek time factored out of the equation.

Even with SSD, if you just go splattering files all over the disk you
will pay for it in latency and lifetime when the disk goes into
continuous erase and your messy layout causes write multiplication.
But of course you can design your filesystem any way you want. Tux3
is designed to be fast on the hardware that people actually have.

Regards,

Daniel