Re: Tux3 Report: How fast can we fsync?

From: Daniel Phillips
Date: Thu Apr 30 2015 - 06:28:06 EST


On Wednesday, April 29, 2015 6:46:16 PM PDT, Dave Chinner wrote:
I measured fsync performance using a 7200 RPM disk as a virtual
drive under KVM, configured with cache=none so that asynchronous
writes are cached and synchronous writes translate into direct
writes to the block device.

Yup, a slow single spindle, so fsync performance is determined by
seek latency of the filesystem. Hence the filesystem that "wins"
will be the filesystem that minimises fsync seek latency above all
other considerations.

http://www.spinics.net/lists/kernel/msg1978216.html

If you want to declare that XFS only works well on solid state disks and big storage arrays, that is your business. But if you do, you can no
longer call XFS a general purpose filesystem. And if you would rather disparage people who report genuine performance bugs than get down to
fixing them, that is your business too. Don't expect to be able to stop the bug reports by bluster.

So, to demonstrate, I'll run the same tests but using a 256GB
samsung 840 EVO SSD and show how much the picture changes.

I will go you one better, I ran a series of fsync tests using tmpfs,
and I now have a very clear picture of how the picture changes. The
executive summary is: Tux3 is still way faster, and still scales way
better to large numbers of tasks. I have every confidence that the same
is true of SSD.

I didn't test tux3, you don't make it easy to get or build.

There is no need to apologize for not testing Tux3, however, it is unseemly to throw mud at the same time. Remember, you are the person who put so much energy into blocking Tux3 from merging last summer. If
it now takes you a little extra work to build it then it is hard to be really sympathetic. Mike apparently did not find it very hard.

To focus purely on fsync, I wrote a
small utility (at the end of this post) that forks a number of
tasks, each of which continuously appends to and fsyncs its own
file. For a single task doing 1,000 fsyncs of 1K each, we have:

Ext4: 34.34s
XFS: 23.63s
Btrfs: 34.84s
Tux3: 17.24s

Ext4: 1.94s
XFS: 2.06s
Btrfs: 2.06s

All equally fast, so I can't see how tux3 would be much faster here.

Running the same thing on tmpfs, Tux3 is significantly faster:

Ext4: 1.40s
XFS: 1.10s
Btrfs: 1.56s
Tux3: 1.07s

Tasks: 10 100 1,000 10,000
Ext4: 0.05s 0.12s 0.48s 3.99s
XFS: 0.25s 0.41s 0.96s 4.07s
Btrfs 0.22s 0.50s 2.86s 161.04s
(lower is better)

Ext4 and XFS are fast and show similar performance. Tux3 *can't* be
very much faster as most of the elapsed time in the test is from
forking the processes that do the IO and fsyncs.

You wish. In fact, Tux3 is a lot faster. You must have made a mistake in estimating your fork overhead. It is easy to check, just run "syncs foo 0 10000". I get 0.23 seconds to fork 10,0000 proceses, create the files and exit. Here are my results on tmpfs, triple checked and reproducible:

Tasks: 10 100 1,000 10,000
Ext4: 0.05 0.14 1.53 26.56
XFS: 0.05 0.16 2.10 29.76
Btrfs: 0.08 0.37 3.18 34.54
Tux3: 0.02 0.05 0.18 2.16

Note: you should recheck your final number for Btrfs. I have seen Btrfs fall off the rails and take wildly longer on some tests just like that.
We know Btrfs has corner case issues, I don't think they deny it. Unlike you, Chris Mason is a gentleman when faced with issues. Instead of insulting his colleagues and hurling around the sort of abuse that has gained LKML its current unenviable reputation, he gets down to work and fixes things.

You should do that too, your own house is not in order. XFS has major issues. One easily reproducible one is a denial of service during the 10,000 task test where it takes multiple seconds to cat small files. I saw XFS do this on both spinning disk and tmpfs, and I have seen it hang for minutes trying to list a directory. I looked a bit into it, and I see that you are blocking for aeons trying to acquire a lock in open.

Here is an example. While doing "sync6 fs/foo 10 10000":

time cat fs/foo999
hello world!
hello world!
hello world!
hello world!
hello world!
hello world!
hello world!
hello world!
hello world!
hello world!

real 0m2.282s
user 0m0.000s
sys 0m0.000s

You and I both know the truth: Ext4 is the only really reliable general purpose filesystem on Linux at the moment. XFS is definitely not, I have seen ample evidence with my own eyes. What you need is people helping you fix your issues instead of making your colleagues angry at you with your incessant attacks.

FWIW, btrfs shows it's horrible fsync implementation here, burning
huge amounts of CPU to do bugger all IO. i.e. it burnt all 16p for 2
and a half minutes in that 10000 fork test so wasn't IO bound at
all.

Btrfs is hot and cold. In my tmpfs tests, Btrfs beats XFS at high task counts. It is actually amazing the progress Btrfs has made in performance. I for one appreciate the work they are doing and I admire the way Chris conducts both himself and his project. I wish you were more like Chris, and I wish I was for that matter.

I agree that Btrfs uses too much CPU, but there is no need to be rude about it. I think the Btrfs team knows how to use a profiler.

Is there any practical use for fast parallel fsync of tens of thousands
of tasks? This could be useful for a scalable transaction server
that sits directly on the filesystem instead of a database, as is
the fashion for big data these days. It certainly can't hurt to know
that if you need that kind of scaling, Tux3 will do it.

Ext4 and XFS already do that just fine, too, when you use storage
suited to such a workload and you have a sane interface for
submitting tens of thousands of concurrent fsync operations. e.g

http://oss.sgi.com/archives/xfs/2014-06/msg00214.html

Tux3 turns in really great performance with an ordinary, cheap spinning disk using standard Posix ops. It is not for you to tell people they don't care about that, and it is wrong for you to imply that we only perform well on spinning disk - you don't know that, and it's not true.

By the way, I like your asynchronous fsync, nice work. It by no means
obviates the need for a fast implementation of the standard operation.

On a SSD (256GB samsung 840 EVO), running 4.0.0:

Tasks: 8 16 32
Ext4: 598.27 MB/s 981.13 MB/s 1233.77 MB/s
XFS: 884.62 MB/s 1328.21 MB/s 1373.66 MB/s
Btrfs: 201.64 MB/s 137.55 MB/s 108.56 MB/s

dbench looks *very different* when there is no seek latency,
doesn't it?

It looks like Btrfs hit a bug, not a huge surprise. Btrfs hit an assert
for me earlier this evening. It is rare but it happens. I rebooted and got sane numbers. Running dbench -t10 on tmpfs I get:

Tasks: 8 16 32
Ext4: 660.69 MB/s 708.81 MB/s 720.12 MB/s
XFS: 692.01 MB/s 388.53 MB/s 134.84 MB/s
Btrfs: 229.66 MB/s 341.27 MB/s 377.97 MB/s
Tux3: 1147.12 MB/s 1401.61 MB/s 1283.74 MB/s

Looks like XFS hit a bump and fell off the cliff at 32 threads. I reran
that one many times because I don't want to give you an inaccurate report.

Tux3 turned in a great performance. I am not pleased with the negative scaling at 32 threads, but it still finishes way ahead.

Dbench -t10 -s (all file operations synchronous)

Tasks: 8 16 32
Ext4: 4.51 MB/s 6.25 MB/s 7.72 MB/s
XFS: 4.24 MB/s 4.77 MB/s 5.15 MB/s
Btrfs: 7.98 MB/s 13.87 MB/s 22.87 MB/s
Tux3: 15.41 MB/s 25.56 MB/s 39.15 MB/s
(higher is better)

Ext4: 173.54 MB/s 294.41 MB/s 424.11 MB/s
XFS: 172.98 MB/s 342.78 MB/s 458.87 MB/s
Btrfs: 36.92 MB/s 34.52 MB/s 55.19 MB/s

Again, the numbers are completely the other way around on a SSD,
with the conventional filesystems being 5-10x faster than the
WA/COW style filesystem.

I wouldn't be so sure about that...

Tasks: 8 16 32
Ext4: 93.06 MB/s 98.67 MB/s 102.16 MB/s
XFS: 81.10 MB/s 79.66 MB/s 73.27 MB/s
Btrfs: 43.77 MB/s 64.81 MB/s 90.35 MB/s
Tux3: 198.49 MB/s 279.00 MB/s 318.41 MB/s

In the full disclosure department, Tux3 is still not properly
optimized in some areas. One of them is fragmentation: it is not
very hard to make Tux3 slow down by running long tests. Our current

Oh, that still hasn't been fixed?

Count your blessings while you can.

Until you sort of how you are going to scale allocation to tens of
TB and not fragment free space over time, fsync performance of the
filesystem is pretty much irrelevant. Changing the allocation
algorithms will fundamentally alter the IO patterns and so all these
benchmarks are essentially meaningless.

Ahem, are you the same person for whom fsync was the most important issue in the world last time the topic came up, to the extent of spreading around FUD and entirely ignoring the great work we had accomplished for regular file operations? I said then that when we got around to a proper fsync it would be competitive. Now here it is, so you want to change the topic. I understand.

Honestly, you would be a lot better off investigating why our fsync algorithm is so good.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/