Re: [RFC 1/1] ext4: fail fast on repeated metadata reads after IO failure

From: Zhang Yi

Date: Thu Mar 26 2026 - 07:30:21 EST


On 3/26/2026 3:42 PM, Diangang Li wrote:
> Hi, Yi,
>
> Thanks. Yes, for existing metadata blocks ext4 is read-modify-write, so
> without a successful read (Uptodate) there is no write path to update
> that block.
>
> In the case we're seeing, the read keeps failing (repeated I/O errors on
> the same LBA), so the write never has a chance to run either. Given
> that, would it make sense (as Fengnan suggested) to treat persistent
> media errors (e.g. MEDIUM ERROR / IO ERROR) as non-retryable at the
> filesystem level, i.e. keep failing fast for that block? That would
> avoid the BH_Lock thundering herd and prevent hung tasks.
>

FYI, AFAICT, while this approach makes sense in theory, it actually
faces challenges in fault recovery. This is because these error codes
are not always reliable (especially BLK_STS_IOERR). In some scenarios
where reliability requirements are not very high, customers might not
immediately notice these errors due to transient faults on some storage
devices(such as some network storage scenarios), and these errors might
resolve themselves after a certain period of time. However, after this,
we have to perform some heavy-weight operations, such as stopping
services and remounting the file system, to recover our services. I
believe there will definitely be customers who will complain about
this.

Thanks,
Yi.

> Thanks,
> Diangang
>
> On 3/25/26 10:27 PM, Zhang Yi wrote:
>> Hi, Diangang,
>>
>> On 3/25/2026 7:13 PM, Diangang Li wrote:
>>> Hi Andreas,
>>>
>>> BH_Read_EIO is cleared on successful read or write.
>>
>> I think what Andreas means is, since you modified the ext4_read_bh()
>> interface, if the bh to be read already has the Read_EIO flag set, then
>> subsequent read operations through this interface will directly return
>> failure without issuing a read I/O. At the same time, because its state
>> is also not uptodate, for an existing block, a write request will not be
>> issued either. How can we clear this Read_EIO flag? IIRC, relying solely
>> on ext4_read_bh_nowait() doesn't seem sufficient to achieve this.
>>
>> Thanks,
>> Yi.
>>
>>>
>>> In practice bad blocks are typically repaired/remapped on write, so we
>>> expect recovery after a successful rewrite. If the block is never
>>> rewritten, repeatedly issuing the same failing read does not help.
>>>
>>> We clear the flag on successful reads so the buffer can recover
>>> immediately if the error was transient. Since read-ahead reads are not
>>> blocked, a later successful read-ahead will clear the flag and allow
>>> subsequent synchronous readers to proceed normally.
>>>
>>> Best,
>>> Diangang
>>>
>>> On 3/25/26 6:15 PM, Andreas Dilger wrote:
>>>> On Mar 25, 2026, at 03:33, Diangang Li <diangangli@xxxxxxxxx> wrote:
>>>>>
>>>>> From: Diangang Li <lidiangang@xxxxxxxxxxxxx>
>>>>>
>>>>> ext4 metadata reads serialize on BH_Lock (lock_buffer). If the read
>>>>> fails,
>>>>> the buffer remains !Uptodate. With concurrent callers, each waiter can
>>>>> retry the same failing read after the previous holder drops BH_Lock.
>>>>> This
>>>>> amplifies device retry latency and may trigger hung tasks.
>>>>>
>>>>> In the normal read path the block driver already performs its own
>>>>> retries.
>>>>> Once the retries keep failing, re-submitting the same metadata read
>>>>> from
>>>>> the filesystem just amplifies the latency by serializing waiters on
>>>>> BH_Lock.
>>>>>
>>>>> Remember read failures on buffer_head and fail fast for ext4
>>>>> metadata reads
>>>>> once a buffer has already failed to read. Clear the flag on successful
>>>>> read/write completion so the buffer can recover. ext4 read-ahead uses
>>>>> ext4_read_bh_nowait(), so it does not set the failure flag and remains
>>>>> best-effort.
>>>>
>>>> Not that the patch is bad, but if the BH_Read_EIO flag is set on a
>>>> buffer
>>>> and it prevents other tasks from reading that block again, how would the
>>>> buffer ever become Uptodate to clear the flag?  There isn't enough state
>>>> in a 1-bit flag to have any kind of expiry and later retry.
>>>>
>>>> Cheers, Andreas
>>>
>>
>
>