Re: [dm-devel] [PATCH 0/6] dax poison recovery with RWF_RECOVERY_DATA flag

From: Pavel Begunkov
Date: Sun Oct 31 2021 - 09:23:04 EST


On 10/29/21 23:32, Dave Chinner wrote:
On Fri, Oct 29, 2021 at 12:46:14PM +0100, Pavel Begunkov wrote:
On 10/28/21 23:59, Dave Chinner wrote:
[...]
Well, my point is doing recovery from bit errors is by definition not
the fast path. Which is why I'd rather keep it away from the pmem
read/write fast path, which also happens to be the (much more important)
non-pmem read/write path.

The trouble is, we really /do/ want to be able to (re)write the failed
area, and we probably want to try to read whatever we can. Those are
reads and writes, not {pre,f}allocation activities. This is where Dave
and I arrived at a month ago.

Unless you'd be ok with a second IO path for recovery where we're
allowed to be slow? That would probably have the same user interface
flag, just a different path into the pmem driver.

I just don't see how 4 single line branches to propage RWF_RECOVERY
down to the hardware is in any way an imposition on the fast path.
It's no different for passing RWF_HIPRI down to the hardware *in the
fast path* so that the IO runs the hardware in polling mode because
it's faster for some hardware.

Not particularly about this flag, but it is expensive. Surely looks
cheap when it's just one feature, but there are dozens of them with
limited applicability, default config kernels are already sluggish
when it comes to really fast devices and it's not getting better.
Also, pretty often every of them will add a bunch of extra checks
to fix something of whatever it would be.

So let's add a bit of pragmatism to the picture, if there is just one
user of a feature but it adds overhead for millions of machines that
won't ever use it, it's expensive.

Yup, you just described RWF_HIPRI! Seriously, Pavel, did you read
past this? I'll quote what I said again, because I've already
addressed this argument to point out how silly it is:

And you almost got to the initial point in your penult paragraph. A
single if for a single flag is not an issue, what is the problem is
when there are dozens of them and the overhead for it is not isolated,
so the kernel has to jump through dozens of those.

And just to be clear I'll outline again, that's a general problem,
I have no relation to the layers touched and it's up to relevant
people, obviously. Even though I'd expect but haven't found the flag
being rejected in other places, but well I may have missed something.


IOWs, saying that we shouldn't implement RWF_RECOVERY because it
adds a handful of branches to the fast path is like saying that we
shouldn't implement RWF_HIPRI because it slows down the fast path
for non-polled IO....

RWF_HIPRI functionality represents a *tiny* niche in the wider
Linux ecosystem, so by your reasoning it is too expensive to
implement because millions (billions!) of machines don't need or use
it. Do you now see how silly your argument is?

Seriously, this "optimise the IO fast path at the cost of everything
else" craziness has gotten out of hand. Nobody in the filesystem or
application world cares if you can do 10M IOPS per core when all the
CPU is doing is sitting in a tight loop inside the kernel repeatedly
overwriting data in the same memory buffers, essentially tossing the
old away the data without ever accessing it or doing anything with
it. Such speed racer games are *completely meaningless* as an
optimisation goal - it's what we've called "benchmarketing" for a
couple of decades now.

10M you mentioned is just a way to measure, there is nothing wrong
with it. And considering that there are enough of users considering
or already switching to spdk because of performance, the approach
is not wrong. And it goes not only for IO polling, normal irq IO
suffers from the same problems.

A related story is that this number is for a pretty reduced config,
it'll go down with a more default-ish kernel.

If all we focus on is bragging rights because "bigger number is
always better", then we'll end up with iand IO path that looks like
the awful mess that the fs/direct-io.c turned into. That ended up
being hyper-optimised for CPU performance right down to single
instructions and cacheline load orders that the code became
extremely fragile and completely unmaintainable.

We ended up *reimplementing the direct IO code from scratch* so that
XFS could build and submit direct IO smarter and faster because it
simply couldn't be done to the old code. That's how iomap came
about, and without *any optimisation at all* iomap was 20-30% faster
than the old, hyper-optimised fs/direct-io.c code. IOWs, we always
knew we could do direct IO faster than fs/direct-io.c, but we
couldn't make the fs/direct-io.c faster because of the
hyper-optimisation of the code paths made it impossible to modify
and maintain.> The current approach of hyper-optimising the IO path for maximum
per-core IOPS at the expensive of everything else has been proven in
the past to be exactly the wrong approach to be taking for IO path
development. Yes, we need to be concerned about performance and work
to improve it, but we should not be doing that at the cost of
everything else that the IO stack needs to be able to do.

And iomap is great, what you described is a good typical example
of unmaintainable code. I may get wrong what you exactly refer
to, but I don't see maintainability not being considered.

Even more interesting to notice that more often than not extra
features (and flags) almost always hurt maintainability of the
kernel, but then other benefits outweigh (hopefully).

Fundamentally, optimisation is something we do *after* we provide
the functionality that is required; using "fast path optimisation"
as a blunt force implement to prevent new features from being
implemented is just ... obnoxious.

This one doesn't spill yet into paths I care about, but in general
it'd be great if we start thinking more about such stuff instead of
throwing yet another if into the path, e.g. by shifting the overhead
from linear to a constant for cases that don't use it, for instance
with callbacks or bit masks.

This is orthogonal to providing data recovery functionality.
If the claims that flag propagation is too expensive are true, then
fixing this problem this will also improve RWF_HIPRI performance
regardless of whether RWF_DATA_RECOVERY exists or not...

IOWs, *if* there is a fast path performance degradation as a result
of flag propagation, then *go measure it* and show us how much
impact it has on _real world applications_. *Show us the numbers*
and document how much each additional flag propagation actually
costs so we can talk about whether it is acceptible, mitigation
strategies and/or alternative implementations. Flag propagation
overhead is just not a valid reason for preventing us adding new
flags to the IO path. Fix the flag propagation overhead if it's a
problem for you, don't use it as an excuse for preventing people
from adding new functionality that uses flag propagation...

--
Pavel Begunkov