Re: [PATCH 1/1] bcache: fix stale data race between read cache miss and bypass write

From: Ankit Kapoor

Date: Wed May 27 2026 - 09:48:15 EST

Hi Coly,

Thank you for the feedback, for confirming the issue, and for the guidance.

> Hi Ankit,
>
> Yes, I confirm this is an issue that must be solved. Nice catch!
>
> On Thu, May 21, 2026 at 04:39:25PM +0800, Ankit Kapoor wrote:
>> A race condition exists between a read cache miss and a bypass write
>> due to either congestion or sequential bypass, that causes stale data
>> to be cached when the read cache miss runs concurrently with a bypass
>> write targeting the same sectors.
>
> This patch fixes the stale data issue in run time, but if power failure
> happens inside the race window, after boot up again, the stale data
> still exists in cache for following read hits.
>
> And your fix invalidate the key after on-disk bio completed, which makes
> such stale data window by power failure longer.

While I initially hoped that serializing the operations would suffice, I
completely agree with your point regarding the power-failure risk
which shall be addressed.

> To solve all the stale data race both for run time and power failure
> condition, could you please consider the following proposal.
>
> Maintain a data structure to hold all invalidate range from by-pass
> write, record/insert the invalidation range before bch_data_insert(),
> and after cached_dev_write_complete(), clear/remove the invalidation
> range.
>
> For a cache-miss read, if there is any invalidation range refcount
> exists, check all non-zero refcount ranges, if any range overlaps with
> the cache-miss read range, do NOT update the missing bkey back to btree
> and only read data from backing device.

I am now working on a new implementation to track the in-flight
sectors currently being written, exactly as you suggested here.

> Here you need to design a efficient data structure both for performance
> and memory consumption. I would sugguest to maintain chunk refcounts
> which mapping multiple 32MB ranges on cache device (current max key size
> if I remember correctly) range. You may look at how md raid maintains
> the legacy bitmap refcount, hope that code can give you any hint.

Thanks, I will look into the md raid legacy bitmap reference implementation for
hints. In the meantime, could you please recommend any specific fio
configurations or workloads you prefer for evaluating the memory
overhead and performance impact of this change?

I will send a v2 patch series as soon as the tracking mechanism is ready
and thoroughly tested.

Best regards,
Ankit Kapoor