[PATCH/RFC] add "failfast" support for raid1/raid10.

From: NeilBrown
Date: Fri Nov 18 2016 - 00:17:13 EST


Hi,

I've been sitting on these patches for a while because although they
solve a real problem, it is a fairly limited use-case, and I don't
really like some of the details.

So I'm posting them as RFC in the hope that a different perspective
might help me like them better, or find a better approach.

The core idea is that when you have multiple copies of data
(i.e. mirrored drives) it doesn't make sense to wait for a read from
a drive that seems to be having problems. It will probably be faster
to just cancel that read, and read from the other device.
Similarly, in some circumstances, it might be better to fail a drive
that is being slow to respond to writes, rather than cause all writes
to be very slow.

The particular context where this comes up is when mirroring across
storage arrays, where the storage arrays can temporarily take an
unusually long time to respond to requests (firmware updates have
been mentioned). As the array will have redundancy internally, there
is little risk to the data. The mirrored pair is really only for
disaster recovery, and it is deemed better to lose the last few
minutes of updates in the case of a serious disaster, rather than
occasionally having latency issues because one array needs to do some
maintenance for a few minutes. The particular storage arrays in
question are DASD devices which are part of the s390 ecosystem.

Linux block layer has "failfast" flags to direct drivers to fail more
quickly. These patches allow devices in an md array to be given a
"failfast" flag, which will cause IO requests to be marked as
"failfast" providing there is another device available. Once the
array becomes degraded, we stop using failfast, as that could result
in data loss.

I don't like the whole "failfast" concept because it is not at all
clear how fast "fast" is. In fact, these block-layer flags are
really a misnomer. They should be "noretry" flags.
REQ_FAILFAST_DEV means "don't retry requests which reported an error
which seems to come from the device.
REQ_FAILFAST_TRANSPORT means "don't retry requests which seem to
indicate a problem with the transport, rather than the device"
REQ_FAILFAST_DRIVER means .... I'm not exactly sure. I think it
means whatever a particular driver wants it to mean, basically "I
cannot seem to handle this right now, just resend and I'll probably
be more in control next time". It seems to be for internal-use only.

Multipath code uses REQ_FAILFAST_TRANSPORT only, which makes sense.
btrfs uses REQ_FAILFAST_DEV only (for read-ahead) which doesn't seem
to make sense.... why would you ever use _DEV without _TRANSPORT?

None of these actually change the timeouts in the driver or in the
device, which is what I would expect for "failfast", so to get real
"fast failure" you need to enable failfast, and adjust the timeouts.
That is what we do for our customers with DASD.

Anyway, it seems to make sense to use _TRANSPORT and _DEV for
requests from md where there is somewhere to fall-back on.
If we get an error from a "failfast" request, and the array is still
non-degraded, we just fail the device. We don't try to repair read
errors (which is pointless on storage arrays).

It is assumed that some user-space code will notice the failure,
monitor the device to see when it becomes available again, and then
--re-add it. Assuming the array has a bitmap, the --re-add should be
fast and the array will become optimal again without experiencing
excessive latencies.

My two main concerns are:
- does this functionality have any use-case outside of mirrored
storage arrays, and are there other storage arrays which
occasionally inserted excessive latency (seems like a serious
misfeature to me, but I know few of the details)?
- would it be at all possible to have "real" failfast functionality
in the block layer? I.e. something that is based on time rather
than retry count. Maybe in some cases a retry would be
appropriate if the first failure was very fast.
I.e. it would reduce timeouts and decide on retries based on
elapsed time rather than number of attempts.
With this would come the question of "how fast is fast" and I
don't have a really good answer. Maybe md would need to set a
timeout, which it would double whenever it got failures on all
drives. Otherwise the timeout would drift towards (say) 10 times
the typical response time.

So: comments most welcome. As I say, this does address a genuine
need. Just find it hard to like it :-(


Thanks,
NeilBrown

---

NeilBrown (6):
md/failfast: add failfast flag for md to be used by some personalities.
md: Use REQ_FAILFAST_* on metadata writes where appropriate
md/raid1: add failfast handling for reads.
md/raid1: add failfast handling for writes.
md/raid10: add failfast handling for reads.
md/raid10: add failfast handling for writes.


drivers/md/bitmap.c | 15 ++++++--
drivers/md/md.c | 71 +++++++++++++++++++++++++++++++-----
drivers/md/md.h | 27 +++++++++++++-
drivers/md/raid1.c | 79 ++++++++++++++++++++++++++++++++++------
drivers/md/raid1.h | 1 +
drivers/md/raid10.c | 79 +++++++++++++++++++++++++++++++++++++---
drivers/md/raid10.h | 2 +
include/uapi/linux/raid/md_p.h | 7 +++-
8 files changed, 249 insertions(+), 32 deletions(-)

--
Signature