Re: [PATCH 1/1] bcache: fix stale data race between read cache miss and bypass write

From: Coly Li

Date: Wed May 27 2026 - 11:30:26 EST

> 2026年5月27日 21:41，Ankit Kapoor <ankitkap@xxxxxxxxxx> 写道：
>
> Hi Coly,
>
> Thank you for the feedback, for confirming the issue, and for the guidance.
>
>> Hi Ankit,
>>
>> Yes, I confirm this is an issue that must be solved. Nice catch!
>>
>> On Thu, May 21, 2026 at 04:39:25PM +0800, Ankit Kapoor wrote:
>>> A race condition exists between a read cache miss and a bypass write
>>> due to either congestion or sequential bypass, that causes stale data
>>> to be cached when the read cache miss runs concurrently with a bypass
>>> write targeting the same sectors.
>>
>> This patch fixes the stale data issue in run time, but if power failure
>> happens inside the race window, after boot up again, the stale data
>> still exists in cache for following read hits.
>>
>> And your fix invalidate the key after on-disk bio completed, which makes
>> such stale data window by power failure longer.
>
> While I initially hoped that serializing the operations would suffice, I
> completely agree with your point regarding the power-failure risk
> which shall be addressed.
>
>> To solve all the stale data race both for run time and power failure
>> condition, could you please consider the following proposal.
>>
>> Maintain a data structure to hold all invalidate range from by-pass
>> write, record/insert the invalidation range before bch_data_insert(),
>> and after cached_dev_write_complete(), clear/remove the invalidation
>> range.
>>
>> For a cache-miss read, if there is any invalidation range refcount
>> exists, check all non-zero refcount ranges, if any range overlaps with
>> the cache-miss read range, do NOT update the missing bkey back to btree
>> and only read data from backing device.
>
> I am now working on a new implementation to track the in-flight
> sectors currently being written, exactly as you suggested here.
>
>> Here you need to design a efficient data structure both for performance
>> and memory consumption. I would sugguest to maintain chunk refcounts
>> which mapping multiple 32MB ranges on cache device (current max key size
>> if I remember correctly) range. You may look at how md raid maintains
>> the legacy bitmap refcount, hope that code can give you any hint.
>
> Thanks, I will look into the md raid legacy bitmap reference implementation for
> hints. In the meantime, could you please recommend any specific fio
> configurations or workloads you prefer for evaluating the memory
> overhead and performance impact of this change?

Maybe you can use a large and fast SSD as backing device, and do full random I/O with write around mode.
Then try to setup the race windows, that the in-memory refcount may occupy a more memory.

I don’t suggest to use a tree-like structures. Just use a refcount to cover 32MB range on backing device, it can be faster.
If a cache-miss read overlay a refcount covered range, change it to read-without-refill-cache. To avoid the refcounts
Occupy too much memory, if a page’s refcounts are all zero, you may think of releasing this page. This is what I mentioned
how md bitmap manages the pages of bits. Maybe the idea may help a little bit.

>
> I will send a v2 patch series as soon as the tracking mechanism is ready
> and thoroughly tested.

Thank you, for catch this issue and work on the fix.

Coly Li