Tux3 Report: How fast can we fsync?

From: Daniel Phillips
Date: Tue Apr 28 2015 - 19:55:10 EST


Greetings,

This post is dedicated to Ted, who raised doubts a while back about whether Tux3 can ever have a fast fsync:

https://lkml.org/lkml/2013/5/11/128
"Re: Tux3 Report: Faster than tmpfs, what?"

Ted suggested that Tux3's inherently asynchronous design would be a limitation when it comes to synchronous operations like fsync. As he put it, "any advantage of decoupling the front/back end is nullified" because of temporal coupling. We found the opposite to be true: our asynchronous design works as well for synchronous operations as it does for any other, and Tux3 now has a high performance fsync to prove it.

Until now, Tux3 handled fsync as a full delta commit equivalent to syncfs, just like Ext3. This is semantically correct but sometimes commits more data than necessary and creates a bottleneck by serializing all fsyncs. To optimize it, we added a mechanism for committing any subset of dirty inodes separately from a full delta.

Like our normal delta commit, the design is asynchronous: front end tasks produce fsync work and a backend task commits it. We now have two backends, one to commit fsyncs and another to commit deltas, serialized so the combination is single threaded and mostly lockless. Each fsync moves an inode's dirty blocks from the front delta to the back delta, then queues the inode for the backend. The backend spins away committing fsyncs, opportunistically batching them up into group commits. The fsync backend gives priority to the delta backend: whenever a full delta flush is needed because of cache pressure or cache age, it finishes its fsync work and gets out of the way.

This approach required only minimal changes to our existing delta commit mechanism, mainly to support crash consistency. In particular, our delta model did not need to be changed at all, and block forking protects fsync commits in the same way as normal delta commits. That is, an inode can be freely updated (even via mmap) while it is being fsynced. The fsync data is isolated from those changes and the frontend does not stall.

I measured fsync performance using a 7200 RPM disk as a virtual drive under KVM, configured with cache=none so that asynchronous writes are cached and synchronous writes translate into direct writes to the block device. To focus purely on fsync, I wrote a small utility (at the end of this post) that forks a number of tasks, each of which continuously appends to and fsyncs its own file. For a single task doing 1,000 fsyncs of 1K each, we have:

Ext4: 34.34s
XFS: 23.63s
Btrfs: 34.84s
Tux3: 17.24s

Tux3 has a nice advantage for isolated fsyncs. This is mainly due to writing a small number of blocks per commit, currently just five blocks for a small file. As for a normal delta commit, we do not update bitmap blocks or index nodes, but log logical changes instead, flushing out the primary metadata with occasional "unify" deltas to control the log size and keep replay fast. The 1,000 fsync test typically does about three unifies, so we do in fact update all our metadata and pay that cost, just not on every commit.

The win is bigger than it appears at first glance, because writing the commit block takes about 16.5 ms. That would be similar for all the filesystems tested, so factoring that out, we see that Tux3 must be doing 10 times less work than XFS, 24 times less than Ext4 and 25 times less than Btrfs. Over time, that gap should widen as we reduce our small file commit overhead towards just two blocks.

In this single threaded test, we pay the full price for "communication delays" with our asynchronous backend, which evidently does not amount to much. Given that a context switch is on the order of microseconds while writing the commit block is on the order of milliseconds, it is unsurprising that two extra context switches just disappear in the noise.

Things get more interesting with parallel fsyncs. In this test, each task does ten fsyncs and task count scales from ten to ten thousand. We see that all tested filesystems are able to combine fsyncs into group commits, with varying degrees of success:

Tasks: 10 100 1,000 10,000
Ext4: 0.79s 0.98s 4.62s 61.45s
XFS: 0.75s 1.68s 20.97s 238.23s
Btrfs 0.53s 0.78s 3.80s 84.34s
Tux3: 0.27s 0.34s 1.00s 6.86s
(lower is better)

We expect sub-linear scaling with respect to tasks as opportunities to combine commits increase, then linear scaling as total write traffic begins to dominate. Tux3 still shows sub-linear scaling even at 10,000 tasks. XFS scales poorly, and also suffers from read starvation at the high end, sometimes taking tens of seconds to cat a file or minutes to list a directory. Ext4 and Tux3 exhibit no such issues, remaining completely responsive under all tested loads. The bottom line for this test is, Tux3 is twice as fast at fsync with a modest task count, and the gap widens to nine times faster than its nearest competitor as task count increases.

Is there any practical use for fast parallel fsync of tens of thousands
of tasks? This could be useful for a scalable transaction server that sits directly on the filesystem instead of a database, as is the fashion for big data these days. It certainly can't hurt to know that if you need that kind of scaling, Tux3 will do it.

Of course, a pure fsync load could be viewed as somewhat unnatural. We
also need to know what happens under a realistic load with buffered
operations mixed with fsyncs. We turn to an old friend, dbench:

Dbench -t10

Tasks: 8 16 32
Ext4: 35.32 MB/s 34.08 MB/s 39.71 MB/s
XFS: 32.12 MB/s 25.08 MB/s 30.12 MB/s
Btrfs: 54.40 MB/s 75.09 MB/s 102.81 MB/s
Tux3: 85.82 MB/s 133.69 MB/s 159.78 MB/s
(higher is better)

Tux3 and Btrfs scale well and are way ahead of Ext4 and XFS, while Ext4 and XFS scale poorly or even negatively. Tux3 is the leader by a wide
margin, beating XFS by more than a factor of 5 at the high end.
Dbench -t10 -s (all file operations synchronous)

Tasks: 8 16 32
Ext4: 4.51 MB/s 6.25 MB/s 7.72 MB/s
XFS: 4.24 MB/s 4.77 MB/s 5.15 MB/s
Btrfs: 7.98 MB/s 13.87 MB/s 22.87 MB/s
Tux3: 15.41 MB/s 25.56 MB/s 39.15 MB/s
(higher is better)

With a pure synchronous load (O_SYNC) the ranking is not changed but the gaps widen, and Tux3 outperforms XFS by a factor of 7.5.

At the risk of overgeneralizing, a trend seems to be emerging: the new, write-anywhere designs run synchronous operations faster, combine synchronous and asynchronous operations more efficiently, and scale better to high task counts than the traditional journaling designs. If there is to be an Ext5, it might be worth considering the merit of abandoning the journal in favor of something along the lines of Tux3's redirect-on-write and logging combination.

Getting back to Ted's question, perhaps an asynchronous design really is a better idea all round, even for synchronous operations, and perhaps there really is such a thing as an improved design that is not just a different set of tradeoffs.

In the full disclosure department, Tux3 is still not properly optimized in some areas. One of them is fragmentation: it is not very hard to make Tux3 slow down by running long tests. Our current allocation algorithm is completely naive - it just allocates the next available block and wraps at the top of volume. After a few wraps, it makes a big mess. So today we are not claiming victory in the benchmark department, we still have some work to do. Today is just about fsync, for which it is fair to say that Tux3 sets a new performance standard.

Regards,

Daniel

Footnote: While I was doing this work, Hirofumi set about improving our full filesystem sync to support merging parallel full syncs into single delta commits. That one change improved fsync performance so much that I almost abandoned my separate fsync work. However, a true, separate fsync with aggressive group commit eventually proved superior, so now we have both: a high performance fsync and a full filesystem sync that is nearly as fast under many loads.

/*
* syncs.c
* * D.R. Phillips, 2015
* * To build: c99 -Wall syncs.c -o syncs
* To run: ./syncs [<filename> [<syncs> [<tasks>]]]
*/

#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <fcntl.h>
#include <sys/wait.h>
#include <errno.h>
#include <sys/stat.h>

char text[1024] = { "hello world!\n" };

int main(int argc, const char *argv[]) {
const char *basename = argc < 1 ? "foo" : argv[1];
char name[100];
int steps = argc < 3 ? 1 : atoi(argv[2]);
int tasks = argc < 4 ? 1 : atoi(argv[3]);
int err, fd;

for (int t = 0; t < tasks; t++) {
snprintf(name, sizeof name, "%s%i", basename, t);
if (!fork())
goto child;
}
for (int t = 0; t < tasks; t++)
wait(&err);
return 0;

child:
fd = creat(name, S_IRWXU);
for (int i = 0; i < steps; i++) {
write(fd, text, sizeof text);
fsync(fd);
}
return 0;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/