Re: [RFC 1/1] ext4: fail fast on repeated metadata reads after IO failure

From: Matthew Wilcox

Date: Wed Mar 25 2026 - 11:29:27 EST

On Wed, Mar 25, 2026 at 04:15:42AM -0600, Andreas Dilger wrote:
> On Mar 25, 2026, at 03:33, Diangang Li <diangangli@xxxxxxxxx> wrote:
> >
> > From: Diangang Li <lidiangang@xxxxxxxxxxxxx>
> >
> > ext4 metadata reads serialize on BH_Lock (lock_buffer). If the read fails,
> > the buffer remains !Uptodate. With concurrent callers, each waiter can
> > retry the same failing read after the previous holder drops BH_Lock. This
> > amplifies device retry latency and may trigger hung tasks.
> >
> > In the normal read path the block driver already performs its own retries.
> > Once the retries keep failing, re-submitting the same metadata read from
> > the filesystem just amplifies the latency by serializing waiters on
> > BH_Lock.
> >
> > Remember read failures on buffer_head and fail fast for ext4 metadata reads
> > once a buffer has already failed to read. Clear the flag on successful
> > read/write completion so the buffer can recover. ext4 read-ahead uses
> > ext4_read_bh_nowait(), so it does not set the failure flag and remains
> > best-effort.
>
> Not that the patch is bad, but if the BH_Read_EIO flag is set on a buffer
> and it prevents other tasks from reading that block again, how would the
> buffer ever become Uptodate to clear the flag? There isn't enough state
> in a 1-bit flag to have any kind of expiry and later retry.

I've been thinking about this problem too, albeit from a folio read
perspective, not from a buffer_head read perspective. You're quite
right that one bit isn't enough. The solution I was considering but
haven't implemented yet was to tell all the current waiters that
the IO has failed, but not set any kind of permanent error flag.

I was thinking about starting with this:

+++ b/include/linux/wait_bit.h
@@ -10,6 +10,7 @@
struct wait_bit_key {
unsigned long *flags;
int bit_nr;
+ int error;
unsigned long timeout;
};

and then adding/changing various APIs to allow an error to be passed in
and noticed by the woken task.

With this change, the thundering herd all wake up, see the error and
return immediately instead of each submitting their own I/O. New reads
will retry the read, but each will only be held up for a maximum of
their own timeout.