Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

From: Dave Chinner
Date: Wed Apr 29 2015 - 20:20:34 EST


On Wed, Apr 29, 2015 at 09:05:26PM +0200, Mike Galbraith wrote:
> Here's something that _might_ interest xfs folks.
>
> cd git (source repository of git itself)
> make clean
> echo 3 > /proc/sys/vm/drop_caches
> time make -j8 test
>
> ext4 2m20.721s
> xfs 6m41.887s <-- ick
> btrfs 1m32.038s
> tux3 1m30.262s
>
> Testing by Aunt Tilly: mkfs, no fancy switches, mount the thing, test.

TL;DR: Results are *very different* on a 256GB Samsung 840 EVO SSD
with slightly slower CPUs (E5-4620 @ 2.20GHz)i, all filesystems
using defaults:

real user sys
xfs 3m16.138s 7m8.341s 14m32.462s
ext4 3m18.045s 7m7.840s 14m32.994s
btrfs 3m45.149s 7m10.184s 16m30.498s

What you are seeing is physical seek distances impacting read
performance. XFS does not optimise for minimal physical seek
distance, and hence is slower than filesytsems that do optimise for
minimal seek distance. This shows up especially well on slow single
spindles.

XFS is *adequate* for the use on slow single drives, but it is
really designed for best performance on storage hardware that is not
seek distance sensitive.

IOWS, XFS just hates your disk. Spend $50 and buy a cheap SSD and
the problem goes away. :)

----

And now in more detail.

It's easy to be fast on empty filesystems. XFS does not aim to be
fast in such situations - it aims to have consistent performance
across the life of the filesystem.

In this case, ext4, btrfs and tux3 have optimal allocation filling
from the outside of the disk, while XFS is spreading the files
across (at least) 4 separate regions of the whole disk. Hence XFS is
seeing seek times on read are much larger than the other filesystems
when the filesystem is empty as it is doing full disk seeks rather
than being confined to the outer edges of spindle.

Thing is, once you've abused those filesytsems for a couple of
months, the files in ext4, btrfs and tux3 are not going to be laid
out perfectly on the outer edge of the disk. They'll be spread all
over the place and so all the filesystems will be seeing large seeks
on read. The thing is, XFS will have roughly the same performance as
when the filesystem is empty because the spreading of the allocation
allows it to maintain better locality and separation and hence
doesn't fragment free space nearly as badly as the oher filesystems.
Free space fragmentation is what leads to performance degradation in
filesystems, and all the other filesystem will have degraded to be
*much worse* than XFS.

Put simply: empty filesystem benchmarking does not show the real
performance of the filesystem under sustained production workloads.
Hence benchmarks like this - while interesting from a theoretical
point of view and are widely used for bragging about whose got the
fastest - are mostly irrelevant to determining how the filesystem
will perform in production environments.

We can also look at this algorithm in a different way: take a large
filesystem (say a few hundred TB) across a few tens of disks in a
linear concat. ext4, btrfs and tux3 will only hit the first disk in
the concat, and so go no faster because they are still bound by
physical seek times. XFS, however, will spread the load across many
(if not all) of the disks, and so effectively reduce the average
seek time by the number of disks doing concurrent IO. Then you'll
see that application level IO concurrency becomes the performance
limitation, not the physical seek time of the hardware.

IOWs, what you don't see here is that the XFS algorithms that make
your test slow will keep *lots* of disks busy. i.e. testing empty
filesystem performance a single, slow disk demonstrates that an
algorithm designed for scalability isn't designed to acheive
physical seek distance minimisation. Hence your storage makes XFS
look particularly poor in comparison to filesystems that are being
designed and optimised for the limitations of single slow spindles...

To further demonstrate that it is physical seek distance that is the
issue here, lets take the seek time out of the equation (e.g. use a
SSD). Doing that will result in basically no difference in
performance between all 4 filesystems as performance will now be
determined by application level concurrency and that is the same for
all tests.

e.g. on a 16p, 16GB RAM VM with storage on a SSDs a "make -j 8"
compile test on a kernel source tree (using my normal test machine
.config) gives:

real user sys
xfs: 4m6.723s 26m21.087s 2m49.426s
ext4: 4m11.415s 26m21.122s 2m49.786s
btrfs: 4m8.118s 26m26.440s 2m50.357s

i.e. take seek times out of the picture, and XFS is just as fast as
any of the other filesystems.

Just about everyone I know uses SSDs in their laptops and machines
that build kernels these days, and spinning disks are rapidly
disappearing from enterprise and HPC environments which also happens
to be the target markets for XFS. Hence filesystem performance on
slow single spindles is the furthest thing away from what we really
need to optimise XFS for.

Indeed, I'll point you to where we are going with fsync optimisation
- it's completely the other end of the scale:

http://oss.sgi.com/archives/xfs/2014-06/msg00214.html

i.e. being able to scale effectively to tens of thousands of fsync
calls every second because that's what applications like ceph and
gluster really need from XFS....

> Are defaults for mkfs.xfs such that nobody sane uses them, or does xfs
> really hate whatever git selftests are doing this much?

It just hates your disk. Spend $50 and buy a cheap SSD and the
problem goes away. :)

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/