Re: [PATCH 0/1] bcache: fix stale data race between read cache miss and bypass write
From: Coly Li
Date: Sun May 24 2026 - 12:20:39 EST
On Thu, May 21, 2026 at 04:39:24PM +0800, Ankit Kapoor wrote:
Hi Ankit,
>From your description and analysis, I feel this is a real issue.
Let me understand this deeper and response you later.
Thanks.
Coly Li
> Overview
> --------
> This series addresses a cache inconsistency issue with stale data in bcache
> that arises from a race condition between a read cache miss and a bypass
> write due to congestion or sequential cutoff. The fix involves sequencing
> the btree invalidation of the bypass write to occur strictly after the
> backing device write.
>
> Race Analysis
> -------------
> The following sequence illustrates how stale data is cached after a read
> cache miss when btree invalidation of a bypass write happens in parallel
> with a delayed write to the backing device:
>
> Write IO Path (Parallel) Read IO Path
> ------------------------ ------------
> |
> [Btree Invalidation]
> |
> | [Cache Miss]
> | |
> | [Btree Placeholder Key Insertion]
> | |
> (Delay in writing |
> to the backing device) |
> | [Cache data from the backing device]
> | |
> +-------------------------->| <-- No key collision detected!
> | [Btree Placeholder Key Replacement]
> | |
> [Write to the |
> backing device] -------------
> CRITICAL BUG:
> Stale data gets cached
>
> Reproduction Steps
> ------------------
> The bug can be reliably reproduced by injecting a 5-second delay into
> the backing device write path via dm-delay. Cache mode is set to
> writearound to simulate bypass write.
>
> 1. Data Preparation:
> # printf -- '%.0s\0' {1..4096} > /tmp/0.txt
> # printf -- '%.0s\1' {1..4096} > /tmp/1.txt
> # echo writearound > /sys/block/bcache0/bcache/cache_mode
> # dd if=/tmp/0.txt of=/media/bcache/data.txt oflag=direct \
> bs=4096 count=1 conv=notrunc
>
> 2. Race Execution:
> # dd if=/tmp/1.txt of=/media/bcache/data.txt oflag=direct \
> bs=4096 count=1 conv=notrunc &
> # sleep 1
> # dd if=/media/bcache/data.txt iflag=direct bs=4096 count=1 \
> status=none | hexdump > ./concurrent-read-result
> # sleep 10
> # dd if=/media/bcache/data.txt iflag=direct bs=4096 count=1 \
> status=none | hexdump > ./second-read-result
>
> 3. Results (Without Patch):
> # cat second-read-result
> 0000000 0000 0000 0000 0000 0000 0000 0000 0000 # <--- STALE READ
>
> Proposed Fix
> ------------
> The fix enforces strict total (sequential) order of btree invalidation
> after write to the backing device in a bypass write:
>
> OLD FLOW NEW FLOW
> ------------------------------- --------------------------------
> [ Write Start ] [ Write Start ]
> | |
> +-------+-------+ |
> | | v
> v v [ Write to ]
> [ Btree ] [ Write to ] [ backing-device ]
> [ Invalidation ] [ backing-device] |
> | | v
> +-------+-------+ [ Btree ]
> | [ Invalidation ]
> v |
> [ Write End ] v
> [ Write End ]
>
> Enforcing this sequential execution ensures that either:
> 1. A stale read is followed and invalidated by the deferred write
> invalidation flow.
> 2. The write invalidation executes first, forcing the subsequent read
> path's key replacement sequence to properly catch the collision.
>
> Failure Handling
> ----------------
> This patch keeps existing error-handling behavior intact. Although
> execution is now sequential, btree invalidation is still triggered
> regardless of whether the write to the backing device succeeds
> or fails.
>
> Verification and Performance
> ----------------------------
> Manual Results (With Patch):
> # cat second-read-result
> 0000000 0101 0101 0101 0101 0101 0101 0101 0101 # <--- CORRECT DATA
>
> Stress Verification:
> FIO was executed under a write-only workload (128 KB Write, libaio,
> iodepth=64, direct=1). Without the patch, FIO reported CRC errors
> due to stale read corruptions; with the patch, zero CRC errors or
> corruptions were reported.
>
> Write-Only Workload (FIO Averages CSV):
> Metric,With Fix,Without Fix,Delta
>
> Write IOPS,1630,1630,0.00%
> Write Bandwidth (MiB/s),204,204,0.00%
> Write Avg Latency (micro second),39219.95,39219.58,0.00%
>
> Test Environment
> ----------------
> - CPU: 1 vCPU, Intel Haswell x86_64 (n1-standard-1 instance)
> - Memory: 3.75 GB RAM
> - OS: Linux 6.12.68 (Google COS)
> - Storage: Google Cloud SSD PD + Local SSD
>
> Ankit Kapoor (1):
> bcache: fix stale data race between read cache miss and bypass write
>
> drivers/md/bcache/request.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> --
> 2.54.0.669.g59709faab0-goog