Re: [PATCH/RFC] add "failfast" support for raid1/raid10.

From: Shaohua Li
Date: Mon Nov 21 2016 - 22:53:22 EST


On Fri, Nov 18, 2016 at 04:16:11PM +1100, Neil Brown wrote:
> Hi,
>
> I've been sitting on these patches for a while because although they
> solve a real problem, it is a fairly limited use-case, and I don't
> really like some of the details.
>
> So I'm posting them as RFC in the hope that a different perspective
> might help me like them better, or find a better approach.
>
> The core idea is that when you have multiple copies of data
> (i.e. mirrored drives) it doesn't make sense to wait for a read from
> a drive that seems to be having problems. It will probably be faster
> to just cancel that read, and read from the other device.
> Similarly, in some circumstances, it might be better to fail a drive
> that is being slow to respond to writes, rather than cause all writes
> to be very slow.
>
> The particular context where this comes up is when mirroring across
> storage arrays, where the storage arrays can temporarily take an
> unusually long time to respond to requests (firmware updates have
> been mentioned). As the array will have redundancy internally, there
> is little risk to the data. The mirrored pair is really only for
> disaster recovery, and it is deemed better to lose the last few
> minutes of updates in the case of a serious disaster, rather than
> occasionally having latency issues because one array needs to do some
> maintenance for a few minutes. The particular storage arrays in
> question are DASD devices which are part of the s390 ecosystem.
>
> Linux block layer has "failfast" flags to direct drivers to fail more
> quickly. These patches allow devices in an md array to be given a
> "failfast" flag, which will cause IO requests to be marked as
> "failfast" providing there is another device available. Once the
> array becomes degraded, we stop using failfast, as that could result
> in data loss.
>
> I don't like the whole "failfast" concept because it is not at all
> clear how fast "fast" is. In fact, these block-layer flags are
> really a misnomer. They should be "noretry" flags.
> REQ_FAILFAST_DEV means "don't retry requests which reported an error
> which seems to come from the device.
> REQ_FAILFAST_TRANSPORT means "don't retry requests which seem to
> indicate a problem with the transport, rather than the device"
> REQ_FAILFAST_DRIVER means .... I'm not exactly sure. I think it
> means whatever a particular driver wants it to mean, basically "I
> cannot seem to handle this right now, just resend and I'll probably
> be more in control next time". It seems to be for internal-use only.
>
> Multipath code uses REQ_FAILFAST_TRANSPORT only, which makes sense.
> btrfs uses REQ_FAILFAST_DEV only (for read-ahead) which doesn't seem
> to make sense.... why would you ever use _DEV without _TRANSPORT?
>
> None of these actually change the timeouts in the driver or in the
> device, which is what I would expect for "failfast", so to get real
> "fast failure" you need to enable failfast, and adjust the timeouts.
> That is what we do for our customers with DASD.
>
> Anyway, it seems to make sense to use _TRANSPORT and _DEV for
> requests from md where there is somewhere to fall-back on.
> If we get an error from a "failfast" request, and the array is still
> non-degraded, we just fail the device. We don't try to repair read
> errors (which is pointless on storage arrays).
>
> It is assumed that some user-space code will notice the failure,
> monitor the device to see when it becomes available again, and then
> --re-add it. Assuming the array has a bitmap, the --re-add should be
> fast and the array will become optimal again without experiencing
> excessive latencies.
>
> My two main concerns are:
> - does this functionality have any use-case outside of mirrored
> storage arrays, and are there other storage arrays which
> occasionally inserted excessive latency (seems like a serious
> misfeature to me, but I know few of the details)?
> - would it be at all possible to have "real" failfast functionality
> in the block layer? I.e. something that is based on time rather
> than retry count. Maybe in some cases a retry would be
> appropriate if the first failure was very fast.
> I.e. it would reduce timeouts and decide on retries based on
> elapsed time rather than number of attempts.
> With this would come the question of "how fast is fast" and I
> don't have a really good answer. Maybe md would need to set a
> timeout, which it would double whenever it got failures on all
> drives. Otherwise the timeout would drift towards (say) 10 times
> the typical response time.
>
> So: comments most welcome. As I say, this does address a genuine
> need. Just find it hard to like it :-(

Patches looks good. As long as these don't break normal raid array (while they
don't if the superblock bit isn't set), I have no objection to apply the
patches even they are for special usage. I'll add the series to the next tree.

Just curious: will the FAILFAST increase the chance IO failure?

Thanks,
Shaohua