Re: Tux3 Report: How fast can we fsync?

From: Daniel Phillips
Date: Tue May 12 2015 - 13:41:16 EST


Tux3 Report: How fast can we fail?

Tux3 now has a preliminary out of space handling algorithm. This might
sound like a small thing, but in fact handling out of space reliably
and efficiently is really hard, especially for Tux3. We developed an
original solution with unusually low overhead in the common case, and
simple enough to prove correct. Reliability seems good so far. But not
to keep anyone in suspense: Tux3 does not fail very fast, but it fails
very reliably. We like to think that Tux3 is better at succeeding than
failing.

We identified the following quality metrics for this algorithm:

1) Never fails to detect out of space in the front end.
2) Always fills a volume to 100% before reporting out of space.
3) Allows rm, rmdir and truncate even when a volume is full.
4) Writing to a nearly full volume is not excessively slow.
5) Overhead is insignificant when a volume is far from full.

Like every filesystem that does delayed allocation, Tux3 must guess how
much media space will be needed to commit any update it accepts into
cache. It must not guess low or the commit may fail and lose data. This
is especially tricky for Tux3 because it does not track individual
updates, but instead, partitions updates atomically into delta groups
and commits each delta as an atomic unit. A single delta can be as
large as writable cache, including thousands of individual updates.
This delta scheme ensures perfect namespace, metadata and data
consistency without complex tracking of relationships between thousands
of cache objects, and also does delayed allocation about as well as it
can be done. Given these benefits, it is not too hard to accept some
extra pain in out of space accounting.

Speaking of accounting, we borrow some of that terminology to talk
about the problem. Each delta has a "budget" and computes a "balance"
that declines each time a transaction "cost" is "charged" against it.
The budget is all of free space, plus some space that belongs to
the current disk image that we know will be released soon, and less
a reserve for taking care of certain backend duties. When the balance
goes negative, the transaction backs out its cost, triggers a delta
transition, and tries again. This has the effect of shrinking the delta
size as a volume approaches full. When the delta budget finally shrinks
to less than the transaction cost, the update fails with ENOSPC.

This is where the "how fast can we fail" question comes up. If our guess
at cost is way higher than actual blocks consumed, deltas take a long
time to shrink. Overestimating transaction cost by a factor of ten
can trigger over a hundred deltas before failing. Fortunately, deltas
are pretty fast, so we only keep the user waiting for a second or so
before delivering the bad news. We also slow down measurably, but not
horribly, when getting close to full. Ext4 by contrast flies along at
full speed right until it fills the volume, and stops on a dime at
exactly at 100% full. I don't think that Tux3 will ever be as good at
failing as that, but we will try to get close.

Before I get into how Tux3's out of space behavior stacks up against
other filesystems, there are some interesting details to touch on about
how we go about things.

Tux3's front/back arrangement is lockless, which is great for
performance but turns into a problem when front and back need to
cooperate about something like free space accounting. If we were willing
to add a spinlock between front and back this would be easy, but don't
want to do that. Not only are we jealously protective of our lockless
design, but if our fast path suddenly became slower because of adding
essential functionality we might need to revise some posted benchmark
results. Better that we should do it right and get our accounting
almost for free.

The world of lockless algorithms is an arcane one indeed, just ask Paul
McKenney about that. The solution we came up with needs just two atomic
adds per transaction, and we will eventually turn one of those into a
per-cpu counter. As mentioned above, a frontend transaction backs out
its cost when the delta balance goes negative, so from the backend's
point of view, the balance is going up and down unpredictably all the
time. Delta transition can happen at any time, and somehow, the backend
must assign the new front delta its budget exactly at transition.
Meanwhile, the front delta balance is still going up and down
unpredictably. See the problem? The issue is, delta transition is truly
asynchronous. We can't change that short of adding locks with the
contention and stalls that go along with them.

Fortunately, one consequence of delta transition is that the total cost
charged to the delta instantly becomes stable when the front delta
becomes the back delta. Volume free space is also stable because only
the backend accesses it. The backend can easily measure the actual
space consumed by the back delta: it is the difference between free
space before and after flushing to media. Updating the front delta
budget is easy because only the backend changes it, but updating the
front delta balance is much harder because the front delta is busy
changing it. If we get this wrong, the resulting slight discrepancies
between budget, balance and charged costs would mean that somebody,
somewhere will hit out of space in the middle of a commit and end up
sticking pins into a voodoo doll that looks like us.

A solution was found that only took a few lines of code and some pencil
pushing. The backend knows what the front delta balance must have been
exactly at transition, because it knows the amount charged to the back
delta, and it knows the original budget. It can therefore deduce how
much the front balance should have increased exactly at transition (it
must always increase) so it adds that difference atomically to the
front delta budget. This has exactly the same effect as setting the
balance atomically at transition time, if that were possible, which it
is not. This technique is airtight, and the whole algorithm ends up
costing less than a hundred nanoseconds per transaction.[1] This is a
good thing because each page of a Tux3 write is a separate transaction,
so any significant overhead would stick out like a sore thumb.

Accounting cost estimates properly and stopping when actually out of
space is just the core of the algorithm. We must feed that core with
good, provable worst case cost estimates. To get an initial idea of
whether the algorithm works, we just plugged in some magic numbers, and
lo and behold, suddenly we where not running out of space in the
backend any more. But to do the job properly we need to consider things
like the file index btree depth, because just plugging in a number large
enough to handle the deepest possible btree would slow down our failure
path way too much.

The best way to account for btree depth is to make it disappear entirely
by removing the btree updates from the delta commit path. We already do
that for bitmaps, which is a good thing because our bitmaps are just
blocks that live in a normal file. Multiplying our worst case by the
maximum number of bitmaps that could possibly be affected, and then
multiplying that by the worst case change to the bitmap metadata,
including its data btree, its inode, and the inode table btree, would be
a real horror. Instead, we log all changes that affect the bitmap and
only update the bitmaps periodically at unify cycles. A Tux3 filesystem
is consistent whether or not we unify, so if space becomes critically
tight the backend can just disable the unify. The only bad effect is
that the log chain can grow and make replay take longer, but that growth
is limited by the fact that there is not much space left for more log
blocks.

If we did not have this nice way of making bitmap overhead disappear,
we would not be anywhere close to a respectable implementation today.
Actually, we weren't even thinking about out of space accounting when
we developed this design element, we were actually trying to get rid of
the overhead of updating bitmaps per delta. Which worked well and is a
significant part of the reason why we can outrun Ext4 while having a
very similar structure. The benefit for space accounting dropped out
just by dumb luck.

The same technique we use for hiding bitmap update cost works just as
well for btree metadata. Eventually, we will move btree leaf redirecting
from the delta flush to the unify flush. That will speed it up by
coalescing some index block writes and also make it vanish from the
transaction cost estimate, saving frontend CPU and speeding up the
failure path. What's not to like? It is on the list of things to do.

Today, I refactored the budgeting algorithm to skip the cost estimate
if a page is already dirty, which tightened up the estimate by a factor
of four or so and made things run smoother. There will be more
incremental improvements as time goes by. For example, we currently
overestimate the cost of a rewrite because we would need to go poking
around in btrees to do that more accurately. Fixing that will be quite
a bit of work, but somebody will probably do it, sometime.

Now the fun part: performance and bugs. Being anxious to know where
Tux3 stands with respect to the usual suspects, I ran some tests and
found that Ext4 is amazingly good at this, while XFS and Btrfs have
some serious issues. Details below.

Tux3 slows down when approaching a full state, hopefully not too much.
To quantify that, here is what happens with a 200 MB dd to a loopback
mounted file on tmpfs:

Volume Size Run Time
No check at all: 1500 MB 0.306s
Far from full: 1500 MB 0.318s
Getting full: 30 MB 0.386s
Just over full: 20 MB 0.624s

The slowdown used to be a lot more before I improved the cost estimate
for write earlier today. Here is how we compare to other filesystems:

Far from full Just over full

tux3: 0.303s 0.468s
ext4: 0.399s 0.400s
xfs: 0.293s 0.326s
btrfs: 0.499s 0.531s
(20 mb dd to ramdisk)

XFS eeks out a narrow win on straight up dd to the ramdisk, good job.
The gap widens when hitting the failure path, but again, not as much as
it did earlier today.

I do most of these no space tests on a ramdisk (actually, a loopback
mount on tmpfs) because it is easy to fill up. To show that the ramdisk
results are not wildly different from a real disk, here we see that the
pattern is largely unchanged:

20 MB dd to a real disk

tux3: 1.568s
ext4: 1.523s
xfs: 1.466s
btrfs: 2.347s

XFS holds its dd lead on a real hard disk. We definitely need to learn
its trick.

Next we look at something with a bit more meat: unzipping the Git
source to multiple directories. Timings on ramdisk are the interesting
ones, because the volume approaches full on the longer test.

10x to ram 40x to ram 10x to hdd 100x to hdd
tux3: 2.251s 8.344s 2.671s 21.686s
ext4: 2.343s 7.923s 3.165s 32.080s
xfs: 2.682s 10.562s 11.870s 123.435s
btrfs: 3.744s 15.825s 3.749s 72.405s

Tux3 is the fastest when not close to full, but Ext4 takes a slight
lead when close to full. Yesterday, that lead was much wider, and one
day we would be pleased to tie Ext4, which is really exemplary at this.
The hard disk times are there because they happened to be easy to get,
and it is interesting to see how much XFS and BTRFS are suffering on
traditional rust, XFS being the worst by far at 5.7 times slower than
Tux3.

The next one is a crash test: repeatedly untar a tarball until it
smashes into the wall, and see how long it takes to quit with an error.
Tar is nice for this because its failure handling is so awful: instead
of exiting when on the first ENOSPC, it keeps banging at the full disk
until it has failed on each and every file in its archive. First I dd
the volume until just before full, then throw a tarball at it.

Time to fail when tar hits disk full:

tux3: 0.563s
ext4: 0.084s
xfs: 0.116s
btrfs: 0.460s

We respectfully concede that Ext4 is the king of fail and Tux3 is
worst. However, we only need to be good enough on this one, with less
than a second being a very crude definition of good enough.

The next one is something I ran into when I was testing out of space
detection with rewrites. This uses the "blurt" program at the end of
this post to do 40K writes from 1000 tasks in parallel, 1K at a time,
using the bash one liner:

for ((j=1;j<10;j++)); do \
for ((i=1;i<10;i++)); do \
echo step $j:$i 1>&2 && mkdir -p fs/$i && \
~/blurt fs/$i/f 40 1000 || break 2; \
done; \
done

Tux3: 4.136s (28% full when done)
Ext4: 5.780s (31% full when done)
XFS: 79.063s (!)
Btrfs: 15.489s (fails with out of space when 30% full)

Blurt is a minor revision of my fsync load generator without the fsync,
and with an error exit on disk full. The intent of the outer loop is
to do rewrites with a thousand tasks in parallel, and see if out of
space accounting is accurate. XFS and Btrfs both embarrassed themselves
horribly. XFS falls off a performance cliff that makes it 19 times
slower than Tux3, and Btrfs hits ENOSPC when only 30% full according to
df, or 47% full if you prefer to believe its own df command:

Data, single: total=728.00MiB, used=342.25MiB
System, DUP: total=8.00MiB, used=16.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, DUP: total=65.00MiB, used=5.97MiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=16.00MiB, used=0.00B

It seems that Btrfs still has not put its epic ENOSPC nightmare behind
it. I fervently hope that such a fate does not await Tux3, which hope
would appear to be well on its way to being born out.

XFS should not do such bizarre things after 23 years of development,
while being billed as a mature, enterprise grade filesystem. It simply
is not there yet. Ext4 is exemplary in terms of reliability, and Tux3
has been been really good through this round of torture tests, though
I will not claim that it is properly hardened just yet. I know it isn't.
We don't have any open bugs, but that is probably because we only have
two users. But Tux3 is remarkably solid for the number of man years
that have gone into it. Maybe Tux3 really will be ready for the
enterprise before XFS is.

In all of these tests, Tux3, Ext4 and XFS managed to fill up their
volumes to exactly 100%. Tux3 actually has a 100 block emergency
reserve that it never fills, and wastes a few more blocks if the last
transaction does not exactly use up its budget, but apparently that
still falls within the df utility's definition of 100%. Btrfs never gets
this right: full for it tends to range from 96% to 98%, and sometimes is
much lower, like 28%. It has its own definition of disk full in its own
utility, but that does not seem to be very accurate either. This part of
Btrfs needs major work. Even at this early stage, Tux3 is much better
than that.

One thing we can all rejoice over: nobody ever hit out of space while
trying to commit. At least, nobody ever admitted it. And nobody oopsed,
or asserted, though XFS did exhibit some denial of service issues where
the filesystem was unusable for tens of seconds.

Once again, in the full disclosure department: there are some known
holes remaining in Tux3's out of space handling. The unify suspend
algorithm is not yet implemented, without which we cannot guarantee
that out of space will never happen in commit. With the simple expedient
of a 100 block emergency reserve, it has never yet happened, but no
doubt some as yet untested load can make it happen. ENOSPC handling for
mmap is not yet implemented. Cost estimates for namespace operations
are too crude and ignore btree depth. Cost estimates could be tighter
than they are, to give better performance and report disk full more
promptly. The emergency reserve should be set each delta according to
delta budget. Big truncates need to be split over multiple commits
so they always free more blocks than they consume before commit. That
is about it. On the whole, I am really happy with the way this
has worked out.

Well, that is that for today. Tux3 now has decent out of space handling
that appears to work well and has a good strong theoretical basis. It
needs more work, but is no longer a reason to block Tux3 from merging,
if it ever really was.

Regards,

Daniel

[1] Overhead of an uncontended bus locked add is about 6 nanoseconds on
my i5, and about ten times higher when contended.

/*
* Blurt v0.0
*
* A trivial multitasking filesystem load generator
*
* Daniel Phillips, June 2015
*
* to build: c99 -Wall blurt.c -oblurt
* to run: blurt <basename> <steps> <tasks>
*/

#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <fcntl.h>
#include <sys/wait.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/stat.h>

enum { chunk = 1024, sync = 0 };

char text[chunk] = { "hello world!\n" };

int main(int argc, const char *argv[]) {
const char *basename = argc < 1 ? "foo" : argv[1];
char name[100];
int steps = argc < 3 ? 1 : atoi(argv[2]);
int tasks = argc < 4 ? 1 : atoi(argv[3]);
int fd, status, errors = 0;

for (int t = 0; t < tasks; t++) {
snprintf(name, sizeof name, "%s%i", basename, t);
if (!fork())
goto child;
}
for (int t = 0; t < tasks; t++) {
wait(&status);
if (WIFEXITED(status) && WEXITSTATUS(status))
errors++;
}
return !!errors;

child:
fd = creat(name, S_IRWXU);
if (fd == -1)
goto fail1;
for (int i = 0; i < steps; i++) {
int ret = write(fd, text, sizeof text);
if (ret == -1)
goto fail2;
if (sync)
fsync(fd);
}
return 0;
fail1:
perror("create failed");
return 1;
fail2:
perror("write failed");
return 1;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/