Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

From: Dave Chinner
Date: Mon May 02 2016 - 20:46:06 EST


On Mon, May 02, 2016 at 10:53:25AM -0700, Dan Williams wrote:
> On Mon, May 2, 2016 at 8:18 AM, Jeff Moyer <jmoyer@xxxxxxxxxx> wrote:
> > Dave Chinner <david@xxxxxxxxxxxxx> writes:
> [..]
> >> We need some form of redundancy and correction in the PMEM stack to
> >> prevent single sector errors from taking down services until an
> >> administrator can correct the problem. I'm trying to understand
> >> where this is supposed to fit into the picture - at this point I
> >> really don't think userspace applications are going to be able to do
> >> this reliably....
> >
> > Not all storage is configured into a RAID volume, and in some instances,
> > the application is better positioned to recover the data (gluster/ceph,
> > for example). It really comes down to whether applications or libraries
> > will want to implement redundancy themselves in order to get a bump in
> > performance by not going through the kernel. And I think I know what
> > your opinion is on that front. :-)
> >
> > Speaking of which, did you see the numbers Dan shared at LSF on how much
> > overhead there is in calling into the kernel for syncing? Dan, can/did
> > you publish that spreadsheet somewhere?
>
> Here it is:
>
> https://docs.google.com/spreadsheets/d/1pwr9psy6vtB9DOsc2bUdXevJRz5Guf6laZ4DaZlkhoo/edit?usp=sharing
>
> On the "Filtered" tab I have some of the comparisons where:

Those numbers are really wacky - the inconsistent decimal place
representation makes it really, really hard to read the differences
in orders of magnitude, too. Let's take the first numbers - noop, 64
byte ops are:

threads ops/s
1 90M
2 310M
4 65M
8 175M
16 426M

Why aren't these linear? And if the test is not running in an
environment where these are controlled and linear, how valid are the
rest of the tests and hence the comparison.

> noop => don't call msync and don't flush caches in userspace
>
> persist => cache flushing only in userspace and only on individual cache lines

So these look a lot more linear than the no-op behaviour, so I'll
just ignore the no-op results for now.

> persist_4k => cache flushing only in userspace, but flushing is
> performed in 4K aligned units

Urg, your "vs persist" percentages are all wrong. You can't have a
"-1000%" difference, you have "persist 4k" running at 10% of the
speed of "persist".

So, with that in mind, the "persist_4k" speed is:

ops/s single thread
Size vs "persist" 4k flush rate
64 10% 834k
128 13% 849k
256 15% 410k(one off variation?)
512 20% 860k
1024 25% 850k
2048 50% 840k
4096 none 836k
8192 none 410k

What we see here is that the CPU(s) can flush the 4k pages at a rate
of roughly 850,000 flushes/s, whilst the 64 byte flush rate is
around 8.8M flushes/s. This is clearly demonstrated in the numbers
- as the dirty object size approaches the cache flush granularity,
the speed approaches single cacheline flush granularity speed.

Comparing 4k vs 64b flushes, we have 63 clean cache line flushes
taking roughly the same time as 9 dirty cache line flushes. Nice
numbers - that means a clean cache line flush has ~14% of the
overhead of dirty cache line flush. Seems rather high - it's tens of
CPU cycles to determine that the flush is a no-op for that
cacheline.

Fixing this seems like a hardware optimisation issue to me, but I
still have to question how many applications are going to have such
fine-grained random synchronous memory writes that this actually
matters in practice? If we are doing such small writes across
multiple different 4k pages, then TLB overhead for all the page
faults is going to be as much of an issue as 4k cache flushes...

> msync => same granularity flushing as the 'persist' case, but the
> kernel internally promotes this to a 4K sized / aligned flush

So you're calling msync for every modification that is made? What
application needs to do that? Anyway, page flush rates paint an
interesting picture:

single thread versus
Size 4k flush rate persist_4k
64 655k 78%
128 655k 81%
256 670k 163% (* persist 4k number low)
512 681k 79%
1024 666k 78%
2048 650k 77%
4096 652k 78%
8192 390k 95%

msync adds relatively little overhead (~20% extra overhead) compared
to the performance loss from the 4k flush granularity change. And
given this appears to be a worst case test scenario (and I'm sure
msync could be improved), I don't think this demonstrates a problem
with using msync.

IMO, these numbers don't support the argument that the *msync
model* for data integrity for DAX is flawed, unworkable, or too
slow. What I see is a performance problem resulting from the
overhead of flushing clean cachelines. i.e. there's data here that
supports the argument for reducing the overhead of flushing clean
cachelines in the hardware and/or better tracking of dirty
cachelines within the kernel, but not data that says the msync()
based data integrity model is the source of the problem.

i.e. separate the programming model from the performance issue, and
we can see that the performance problem is not caused by the
programming model - it's caused by the kernel implementation of the
model.

> The takeaway is that msync() is 9-10x slower than userspace cache management.

An alternative viewpoint: that flushing clean cachelines is
extremely expensive on Intel CPUs. ;)

i.e. Same numbers, different analysis from a different PoV, and
that gives a *completely different conclusion*.

Think about it for the moment. The hardware inefficiency being
demonstrated could be fixed/optimised in the next hardware product
cycle(s) and so will eventually go away. OTOH, we'll be stuck with
whatever programming model we come up with for the next 30-40 years,
and we'll never be able to fix flaws in it because applications will
be depending on them. Do we really want to be stuck with a pmem
model that is designed around the flaws and deficiencies of ~1st
generation hardware?

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx