Re: xfs: does mkfs.xfs require fancy switches to get decent performance? (was Tux3 Report: How fast can we fsync?)

From: Martin Steigerwald
Date: Thu Apr 30 2015 - 05:00:22 EST

Am Donnerstag, 30. April 2015, 10:20:08 schrieb Dave Chinner:
> On Wed, Apr 29, 2015 at 09:05:26PM +0200, Mike Galbraith wrote:
> > Here's something that _might_ interest xfs folks.
> >
> > cd git (source repository of git itself)
> > make clean
> > echo 3 > /proc/sys/vm/drop_caches
> > time make -j8 test
> >
> > ext4 2m20.721s
> > xfs 6m41.887s <-- ick
> > btrfs 1m32.038s
> > tux3 1m30.262s
> >
> > Testing by Aunt Tilly: mkfs, no fancy switches, mount the thing, test.
> TL;DR: Results are *very different* on a 256GB Samsung 840 EVO SSD
> with slightly slower CPUs (E5-4620 @ 2.20GHz)i, all filesystems
> using defaults:
> real user sys
> xfs 3m16.138s 7m8.341s 14m32.462s
> ext4 3m18.045s 7m7.840s 14m32.994s
> btrfs 3m45.149s 7m10.184s 16m30.498s
> What you are seeing is physical seek distances impacting read
> performance. XFS does not optimise for minimal physical seek
> distance, and hence is slower than filesytsems that do optimise for
> minimal seek distance. This shows up especially well on slow single
> spindles.
> XFS is *adequate* for the use on slow single drives, but it is
> really designed for best performance on storage hardware that is not
> seek distance sensitive.
> IOWS, XFS just hates your disk. Spend $50 and buy a cheap SSD and
> the problem goes away. :)

I am quite surprised that a traditional filesystem that was created in the
age of rotating media does not like this kind of media and even seems to
excel on BTRFS on the new non rotating media available.


> ----
> And now in more detail.
> It's easy to be fast on empty filesystems. XFS does not aim to be
> fast in such situations - it aims to have consistent performance
> across the life of the filesystem.

â this is a quite important addition.

> Thing is, once you've abused those filesytsems for a couple of
> months, the files in ext4, btrfs and tux3 are not going to be laid
> out perfectly on the outer edge of the disk. They'll be spread all
> over the place and so all the filesystems will be seeing large seeks
> on read. The thing is, XFS will have roughly the same performance as
> when the filesystem is empty because the spreading of the allocation
> allows it to maintain better locality and separation and hence
> doesn't fragment free space nearly as badly as the oher filesystems.
> Free space fragmentation is what leads to performance degradation in
> filesystems, and all the other filesystem will have degraded to be
> *much worse* than XFS.

I even still see hungs on what I tend to see as freespace fragmentation in
BTRFS. My /home on a Dual (!) BTRFS SSD setup can basically stall to a
halt when it has reserved all space of the device for chunks. So this

merkaba:~> btrfs fi sh /home
Label: 'home' uuid: [â]
Total devices 2 FS bytes used 129.48GiB
devid 1 size 170.00GiB used 146.03GiB path /dev/mapper/msata-
devid 2 size 170.00GiB used 146.03GiB path /dev/mapper/sata-

Btrfs v3.18
merkaba:~> btrfs fi df /home
Data, RAID1: total=142.00GiB, used=126.72GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.00GiB, used=2.76GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

is safe, but one I have size 170 GiB user 170 GiB, even if inside the
chunks there is enough free space to allocate from, enough as in 30-40
GiB, it can happen that writes are stalled up to the point that
applications on the desktop freeze and I see hung task messages in kernel

This is the case upto kernel 4.0. I have seen Chris Mason fixing some write
stalls for big facebook setups, maybe it will help here, but unless this
issue is fixed, I think BTRFS is not yet fully production ready, unless you
leave *huge* amount of free space, as in for 200 GiB of data you want to
write make a 400 GiB volume.

> Put simply: empty filesystem benchmarking does not show the real
> performance of the filesystem under sustained production workloads.
> Hence benchmarks like this - while interesting from a theoretical
> point of view and are widely used for bragging about whose got the
> fastest - are mostly irrelevant to determining how the filesystem
> will perform in production environments.
> We can also look at this algorithm in a different way: take a large
> filesystem (say a few hundred TB) across a few tens of disks in a
> linear concat. ext4, btrfs and tux3 will only hit the first disk in
> the concat, and so go no faster because they are still bound by
> physical seek times. XFS, however, will spread the load across many
> (if not all) of the disks, and so effectively reduce the average
> seek time by the number of disks doing concurrent IO. Then you'll
> see that application level IO concurrency becomes the performance
> limitation, not the physical seek time of the hardware.

That are the allocation groups. I always wondered how it can be beneficial
to spread the allocations onto 4 areas of one partition on expensive seek
media. Now that makes better sense for me. I always had the gut impression
that XFS may not be the fastest in all cases, but it is one of the
filesystem with the most consistent performance over time, but never was
able to fully explain why that is.

Martin 'Helios' Steigerwald -
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at