Re: [RFC 1/1] ext4: fail fast on repeated metadata reads after IO failure

From: Diangang Li

Date: Thu Mar 26 2026 - 03:44:17 EST

Hi, Yi,

Thanks. Yes, for existing metadata blocks ext4 is read-modify-write, so
without a successful read (Uptodate) there is no write path to update
that block.

In the case we're seeing, the read keeps failing (repeated I/O errors on
the same LBA), so the write never has a chance to run either. Given
that, would it make sense (as Fengnan suggested) to treat persistent
media errors (e.g. MEDIUM ERROR / IO ERROR) as non-retryable at the
filesystem level, i.e. keep failing fast for that block? That would
avoid the BH_Lock thundering herd and prevent hung tasks.

Thanks,
Diangang

On 3/25/26 10:27 PM, Zhang Yi wrote:
> Hi, Diangang,
>
> On 3/25/2026 7:13 PM, Diangang Li wrote:
>> Hi Andreas,
>>
>> BH_Read_EIO is cleared on successful read or write.
>
> I think what Andreas means is, since you modified the ext4_read_bh()
> interface, if the bh to be read already has the Read_EIO flag set, then
> subsequent read operations through this interface will directly return
> failure without issuing a read I/O. At the same time, because its state
> is also not uptodate, for an existing block, a write request will not be
> issued either. How can we clear this Read_EIO flag? IIRC, relying solely
> on ext4_read_bh_nowait() doesn't seem sufficient to achieve this.
>
> Thanks,
> Yi.
>
>>
>> In practice bad blocks are typically repaired/remapped on write, so we
>> expect recovery after a successful rewrite. If the block is never
>> rewritten, repeatedly issuing the same failing read does not help.
>>
>> We clear the flag on successful reads so the buffer can recover
>> immediately if the error was transient. Since read-ahead reads are not
>> blocked, a later successful read-ahead will clear the flag and allow
>> subsequent synchronous readers to proceed normally.
>>
>> Best,
>> Diangang
>>
>> On 3/25/26 6:15 PM, Andreas Dilger wrote:
>>> On Mar 25, 2026, at 03:33, Diangang Li <diangangli@xxxxxxxxx> wrote:
>>>>
>>>> From: Diangang Li <lidiangang@xxxxxxxxxxxxx>
>>>>
>>>> ext4 metadata reads serialize on BH_Lock (lock_buffer). If the read
>>>> fails,
>>>> the buffer remains !Uptodate. With concurrent callers, each waiter can
>>>> retry the same failing read after the previous holder drops BH_Lock.
>>>> This
>>>> amplifies device retry latency and may trigger hung tasks.
>>>>
>>>> In the normal read path the block driver already performs its own
>>>> retries.
>>>> Once the retries keep failing, re-submitting the same metadata read
>>>> from
>>>> the filesystem just amplifies the latency by serializing waiters on
>>>> BH_Lock.
>>>>
>>>> Remember read failures on buffer_head and fail fast for ext4
>>>> metadata reads
>>>> once a buffer has already failed to read. Clear the flag on successful
>>>> read/write completion so the buffer can recover. ext4 read-ahead uses
>>>> ext4_read_bh_nowait(), so it does not set the failure flag and remains
>>>> best-effort.
>>>
>>> Not that the patch is bad, but if the BH_Read_EIO flag is set on a
>>> buffer
>>> and it prevents other tasks from reading that block again, how would the
>>> buffer ever become Uptodate to clear the flag? There isn't enough state
>>> in a 1-bit flag to have any kind of expiry and later retry.
>>>
>>> Cheers, Andreas
>>
>