Re: Tux3 Report: How fast can we fsync?
From: Daniel Phillips
Date: Sat May 02 2015 - 06:26:24 EST
On Friday, May 1, 2015 6:07:48 PM PDT, David Lang wrote:
On Fri, 1 May 2015, Daniel Phillips wrote:
On Friday, May 1, 2015 8:38:55 AM PDT, Dave Chinner wrote:
Well, yes - I never claimed XFS is a general purpose filesystem. It
is a high performance filesystem. Is is also becoming more relevant
to general purpose systems as low cost storage gains capabilities
that used to be considered the domain of high performance storage...
OK. Well, Tux3 is general purpose and that means we care about single
spinning disk and small systems.
keep in mind that if you optimize only for the small systems
you may not scale as well to the larger ones.
Tux3 is designed to scale, and it will when the time comes. I look
forward to putting Shardmap through its billion file test in due course.
However, right now it would be wise to stay focused on basic
functionality suited to a workstation because volunteer devs tend to
have those. After that, phones are a natural direction, where hard core
ACID commit and really smooth file ops are particularly attractive.
per the ramdisk but, possibly not as relavent as you may think.
This is why it's good to test on as many different systems as
you can. As you run into different types of performance you can
then pick ones to keep and test all the time.
I keep being surprised how well it works for things we never tested
before.
Single spinning disk is interesting now, but will be less
interesting later. multiple spinning disks in an array of some
sort is going to remain very interesting for quite a while.
The way to do md well is to integrate it into the block layer like
Freebsd does (GEOM) and expose a richer interface for the filesystem.
That is how I think Tux3 should work with big iron raid. I hope to be
able to tackle that sometime before the stars start winking out.
now, some things take a lot more work to test than others.
Getting time on a system with a high performance, high capacity
RAID is hard, but getting hold of an SSD from Fry's is much
easier. If it's a budget item, ping me directly and I can donate
one for testing (the cost of a drive is within my unallocated
budget and using that to improve Linux is worthwhile)
Thanks.
As I'm reading Dave's comments, he isn't attacking you the way
you seem to think he is. He is pointing ot that there are
problems with your data, but he's also taking a lot of time to
explain what's happening (and yes, some of this is probably
because your simple tests with XFS made it look so bad)
I hope the lightening up trend is a trend.
the other filesystems don't use naive algortihms, they use
something more complex, and while your current numbers are
interesting, they are only preliminary until you add something
to handle fragmentation. That can cause very significant
problems.
Fsync is pretty much agnostic to fragmentation, so those results are
unlikely to change substantially even if we happen to do a lousy job on
allocation policy, which I naturally consider unlikely. In fact, Tux3
fsync is going to get faster over time for a couple of reasons: the
minimum blocks per commit will be reduced, and we will get rid of most
of the seeks to beginning of volume that we currently suffer per commit.
Remember how fabulous btrfs looked in the initial
reports? and then corner cases were found that caused real
problems and as the algorithms have been changed to prevent
those corner cases from being so easy to hit, the common case
has suffered somewhat. This isn't an attack on Tux2 or btrfs,
it's just a reality of programming. If you are not accounting
for all the corner cases, everything is easier, and faster.
Mine is a lame i5 minitower with 4GB from Fry's. Yours is clearly way
more substantial, so I can't compare my numbers directly to yours.
If you are doing tests with a 4G ramdisk on a machine with only
4G of RAM, it seems like you end up testing a lot more than just
the filesystem. Testing in such low memory situations can
indentify significant issues, but it is questionable as a 'which
filesystem is better' benchmark.
A 1.3 GB tmpfs, and sorry, it is 10 GB (the machine next to it is 4G).
I am careful to ensure the test environment does not have spurious
memory or cpu hogs. I will not claim that this is the most sterile test
environment possible, but it is adequate for the task at hand. Nearly
always, when I find big variations in the test numbers it turns out to
be a quirk of one filesystem that is not exhibited by the others.
Everything gets multiple runs and lands in a spreadsheet. Any fishy
variance is investigated.
By the way, the low variance kings by far are Ext4 and Tux3, and of
those two, guess which one is more consistent. XFS is usually steady,
but can get "emotional" with lots of tasks, and Btrfs has regular wild
mood swings whenever the stars change alignment. And while I'm making
gross generalizations: XFS and Btrfs go OOM way before Ext4 and Tux3.
Just a suggestion, but before you do a huge post about how
great your filesystem is performing, making the code avaialble
so that others can test it when prompted by your post is
probably a very good idea. If it means that you have to send out
your post a week later, it's a very small cost for the benefit
of having other people able to easily try it on hardware that
you don't have access to.
Next time. This time I wanted it off my plate as soon as possible so I
could move on to enospc work. And this way is more involving, we get a
little suspense before the rematch.
If there is a reason to post wihtout the code being in the
main, publicised repo, then your post should point people at
what code they can use to duplicate it.
I could have included the patch in the post, it is small enough. If it
still isn't in the repo in a few days then I will post it, to avoid
giving the impression I'm desperately trying to fix obscure bugs in it,
which isn't the case.
but really, 11 months without updating the main repo?? This is
Open Source development, publish early and often.
It's not as bad as that:
https://github.com/OGAWAHirofumi/linux-tux3/commits/hirofumi
https://github.com/OGAWAHirofumi/linux-tux3/commits/hirofumi-user
something to investigate, but I have seen probelms on ext* in
the past. ext4 may have fixed this, or it may just have moved
the point where it triggers.
My spectrum of tests is small and I am not hunting for anomalies, only
reporting what happened to come up. It is not very surprising that some
odd things happen with 10,000 tasks, there is probably not much test
coverage there. On the whole I was surprised and impressed when all
filesystems mostly just worked. I was expecting to hit scheduler issues
for one thing, and nothing obvious came up. Also, not one oops on any
filesystem (even Tux3) and only one assert, already reported upstream
and turned out to be fixed a week or two ago.
...
Your machine makes mine look like a PCjr. ...
The interesting thing here is that on the faster machine btrfs
didn't speed up significantly while ext4 and xfs did. It will be
interesting to see what the results are for tux3
The numbers are well into the something-is-really-wrong zone (and I
should have flagged that earlier but it was a long day). That test is
supposed to be -s, all synchronous, and his numbers are more typical of
async. Needs double checking all round, including here. Anybody can
replicate that test, it is only an apt-get install dbench away (hint
hint).
Differences: my numbers are kvm with loopback mount on tmpfs. His are
on ramdisk and probably native. I have to reboot to make a ramdisk big
enough to run dbench and I would rather not right now.
How important is it to get to the bottom of the variance in test
results running on RAM? Probably important in the long run, because
storage devices are looking more like RAM all the time, but as of
today, maybe not very urgent.
Also, I was half expecting somebody to question the wisdom of running
benchmarks under KVM instead of native, but nobody did. Just for the
record, I would respond: running virtual probably accounts for the
majority of server instances today.
and both of you need to remember that while servers are getting
faster, we are also seeing much lower power, weaker servers
showing up as well. And while these smaller servers are not
trying to do teh 10000 thread fsync workload, they are using
flash based storage more frequently than they are spinning rust
(frequently through the bottleneck of a SD card) so continuing
tests on low end devices is good.
Low end servers and embedded concerns me more, indeed.
what drives are available now? see if you can get a couple
(either directly or donated)
Right, time to hammer on flash.
Regards,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/