Re: Tux3 Report: How fast can we fsync?

From: David Lang
Date: Fri May 01 2015 - 21:08:16 EST


On Fri, 1 May 2015, Daniel Phillips wrote:

On Friday, May 1, 2015 8:38:55 AM PDT, Dave Chinner wrote:

Well, yes - I never claimed XFS is a general purpose filesystem. It
is a high performance filesystem. Is is also becoming more relevant
to general purpose systems as low cost storage gains capabilities
that used to be considered the domain of high performance storage...

OK. Well, Tux3 is general purpose and that means we care about single
spinning disk and small systems.

keep in mind that if you optimize only for the small systems you may not scale as well to the larger ones.

So, to demonstrate, I'll run the same tests but using a 256GB
samsung 840 EVO SSD and show how much the picture changes.

I will go you one better, I ran a series of fsync tests using
tmpfs, and I now have a very clear picture of how the picture
changes. The executive summary is: Tux3 is still way faster, and
still scales way better to large numbers of tasks. I have every
confidence that the same is true of SSD.

/dev/ramX can't be compared to an SSD. Yes, they both have low
seek/IO latency but they have very different dispatch and IO
concurrency models. One is synchronous, the other is fully
asynchronous.

I had ram available and no SSD handy to abuse. I was interested in
measuring the filesystem overhead with the device factored out. I
mounted loopback on a tmpfs file, which seems to be about the same as
/dev/ram, maybe slightly faster, but much easier to configure. I ran
some tests on a ramdisk just now and was mortified to find that I have
to reboot to empty the disk. It would take a compelling reason before
I do that again.

This is an important distinction, as we'll see later on....

I regard it as predictive of Tux3 performance on NVM.

per the ramdisk but, possibly not as relavent as you may think. This is why it's good to test on as many different systems as you can. As you run into different types of performance you can then pick ones to keep and test all the time.

Single spinning disk is interesting now, but will be less interesting later. multiple spinning disks in an array of some sort is going to remain very interesting for quite a while.

now, some things take a lot more work to test than others. Getting time on a system with a high performance, high capacity RAID is hard, but getting hold of an SSD from Fry's is much easier. If it's a budget item, ping me directly and I can donate one for testing (the cost of a drive is within my unallocated budget and using that to improve Linux is worthwhile)

Running the same thing on tmpfs, Tux3 is significantly faster:

Ext4: 1.40s
XFS: 1.10s
Btrfs: 1.56s
Tux3: 1.07s

3% is not "signficantly faster". It's within run to run variation!

You are right, XFS and Tux3 are within experimental error for single
syncs on the ram disk, while Ext4 and Btrfs are way slower:

Ext4: 1.59s
XFS: 1.11s
Btrfs: 1.70s
Tux3: 1.11s

A distinct performance gap appears between Tux3 and XFS as parallel
tasks increase.

It will be interesting to see if this continues to be true on more systems. I hope it does.

You wish. In fact, Tux3 is a lot faster. ...

Yes, it's easy to be fast when you have simple, naive algorithms and
an empty filesystem.

No it isn't or the others would be fast too. In any case our algorithms
are far from naive, except for allocation. You can rest assured that
when allocation is brought up to a respectable standard in the fullness
of time, it will be competitive and will not harm our clean filesystem
performance at all.

There is no call for you to disparage our current achievements, which
are significant. I do not mind some healthy skepticism about the
allocation work, you know as well as anyone how hard it is. However your
denial of our current result is irritating and creates the impression
that you have an agenda. If you want to complain about something real,
complain that our current code drop is not done yet. I will humbly
apologize, and the same for enospc.

As I'm reading Dave's comments, he isn't attacking you the way you seem to think he is. He is pointing ot that there are problems with your data, but he's also taking a lot of time to explain what's happening (and yes, some of this is probably because your simple tests with XFS made it look so bad)

the other filesystems don't use naive algortihms, they use something more complex, and while your current numbers are interesting, they are only preliminary until you add something to handle fragmentation. That can cause very significant problems. Remember how fabulous btrfs looked in the initial reports? and then corner cases were found that caused real problems and as the algorithms have been changed to prevent those corner cases from being so easy to hit, the common case has suffered somewhat. This isn't an attack on Tux2 or btrfs, it's just a reality of programming. If you are not accounting for all the corner cases, everything is easier, and faster.

That's roughly 10x faster than your numbers. Can you describe your
test setup in detail? e.g. post the full log from block device
creation to benchmark completion so I can reproduce what you are
doing exactly?

Mine is a lame i5 minitower with 4GB from Fry's. Yours is clearly way
more substantial, so I can't compare my numbers directly to yours.

If you are doing tests with a 4G ramdisk on a machine with only 4G of RAM, it seems like you end up testing a lot more than just the filesystem. Testing in such low memory situations can indentify significant issues, but it is questionable as a 'which filesystem is better' benchmark.

Clearly the curve is the same: your numbers increase 10x going from 100
to 1,000 tasks and 12x going from 1,000 to 10,000. The Tux3 curve is
significantly flatter and starts from a lower base, so it ends with a
really wide gap. You will need to take my word for that for now. I
promise that the beer is on me should you not find that reproducible.

The repository delay is just about not bothering Hirofumi for a merge
while he finishes up his inode table anti-fragmentation work.

Just a suggestion, but before you do a huge post about how great your filesystem is performing, making the code avaialble so that others can test it when prompted by your post is probably a very good idea. If it means that you have to send out your post a week later, it's a very small cost for the benefit of having other people able to easily try it on hardware that you don't have access to.

If there is a reason to post wihtout the code being in the main, publicised repo, then your post should point people at what code they can use to duplicate it.

but really, 11 months without updating the main repo?? This is Open Source development, publish early and often.

Note: you should recheck your final number for Btrfs. I have seen
Btrfs fall off the rails and take wildly longer on some tests just
like that.

Completely reproducable...

I believe you. I found that Btrfs does that way too much. So does XFS
from time to time, when it gets up into lots of tasks. Read starvation
on XFS is much worse than Btrfs, and XFS also exhibits some very
undesirable behavior with initial file create. Note: Ext4 and Tux3 have
roughly zero read starvation in any of these tests, which pretty much
proves it is not just a block scheduler thing. I don't think this is
something you should dismiss.

something to investigate, but I have seen probelms on ext* in the past. ext4 may have fixed this, or it may just have moved the point where it triggers.

I wouldn't be so sure about that...

Tasks: 8 16 32
Ext4: 93.06 MB/s 98.67 MB/s 102.16 MB/s
XFS: 81.10 MB/s 79.66 MB/s 73.27 MB/s
Btrfs: 43.77 MB/s 64.81 MB/s 90.35 MB/s ...

Ext4: 807.21 MB/s 1089.89 MB/s 867.55 MB/s
XFS: 997.77 MB/s 1011.51 MB/s 876.49 MB/s
Btrfs: 55.66 MB/s 56.77 MB/s 60.30 MB/s

Numbers are again very different for XFS and ext4 on /dev/ramX on my
system. Need to work out why yours are so low....

Your machine makes mine look like a PCjr.

The interesting thing here is that on the faster machine btrfs didn't speed up significantly while ext4 and xfs did. It will be interesting to see what the results are for tux3

and both of you need to remember that while servers are getting faster, we are also seeing much lower power, weaker servers showing up as well. And while these smaller servers are not trying to do teh 10000 thread fsync workload, they are using flash based storage more frequently than they are spinning rust (frequently through the bottleneck of a SD card) so continuing tests on low end devices is good.

I said then that when we
got around to a proper fsync it would be competitive. Now here it
is, so you want to change the topic. I understand.

I haven't changed the topic, just the storage medium. The simple
fact is that the world is moving away from slow sata storage at a
pretty rapid pace and it's mostly going solid state. Spinning disks
also changing - they are going to ZBC based SMR, which is a
compeltely different problem space which doesn't even appear to be
on the tux3 radar....

So where does tux3 fit into a storage future of byte addressable
persistent memory and ZBC based SMR devices?

You won't convince us to abandon spinning rust, it's going to be around
a lot longer than you think. Obviously, we care about SSD and I believe
you will find that Tux3 is more than competitive there. We lay things
out in a very erase block friendly way. We need to address the volume
wrap issue of course, and that is in progress. This is much easier than
spinning disk.

Tux3's redirect-on-write[1] is obviously a natural for SMR, however
I will not get excited about it unless a vendor waves money.

what drives are available now? see if you can get a couple (either directly or donated)

David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/