Re: [PATCH] md/raid5: fix race between reshape and chunk-aligned read

From: FengWei Shih

Date: Tue Apr 14 2026 - 04:27:48 EST


Hi Nan,

Li Nan 於 2026/4/13 下午 03:19 寫道:


在 2026/4/9 13:17, FengWei Shih 写道:
raid5_make_request() checks mddev->reshape_position to decide whether
to allow chunk-aligned reads. However in raid5_start_reshape(), the
layout configuration (raid_disks, algorithm, etc.) is updated before
mddev->reshape_position is set:

   reshape (raid5_start_reshape)        read (raid5_make_request)
   ============================== ===========================
   write_seqcount_begin
   update raid_disks, algorithm...
   set conf->reshape_progress
   write_seqcount_end
                                         check mddev->reshape_position
                                           * still MaxSector, allow
                                         raid5_read_one_chunk()
                                           * use new layout
   raid5_quiesce()
   set mddev->reshape_position

Since reshape_position is not yet updated, raid5_make_request()
considers no reshape is in progress and proceeds with the
chunk-aligned path, but the layout has already changed, causing
raid5_compute_sector() to return an incorrect physical address.

Fix this by reading conf->reshape_progress under gen_lock in
raid5_read_one_chunk() and falling back to the stripe path if a
reshape is in progress.

Signed-off-by: FengWei Shih <dannyshih@xxxxxxxxxxxx>
---
  drivers/md/raid5.c | 8 ++++++++
  1 file changed, 8 insertions(+)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index a8e8d431071b..bded2b86f0ef 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5421,6 +5421,11 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
      sector_t sector, end_sector;
      int dd_idx;
      bool did_inc;
+    int seq;
+
+    seq = read_seqcount_begin(&conf->gen_lock);
+    if (unlikely(conf->reshape_progress != MaxSector))
+        return 0;
        if (!in_chunk_boundary(mddev, raid_bio)) {
          pr_debug("%s: non aligned\n", __func__);
@@ -5431,6 +5436,9 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
                        &dd_idx, NULL);
      end_sector = sector + bio_sectors(raid_bio);
  +    if (read_seqcount_retry(&conf->gen_lock, seq))
+        return 0;
+
      if (r5c_big_stripe_cached(conf, sector))
          return 0;

It seems that there might be race issues wherever raid5_compute_sector is
used? This fix only addresses one of the problems.

Thanks for the review. You are right that this race pattern affects
more than just raid5_read_one_chunk(). I checked the callers of
raid5_compute_*() and some lockless reshape_progress / reshape_position
checks:

Already safe:

- make_stripe_request(): already uses gen_lock seqcount properly.

- init_stripe() / stripe_set_idx(): init_stripe() is under gen_lock;
  stripe_set_idx() is only used for the dst stripe in handle_stripe().

- handle_stripe_expansion() / reshape_request(): reshape-internal,
  intentional new layout.

- raid5-cache.c / raid5-ppl.c: journal/PPL are not allowed with
  reshape, so no race.

- raid5_bio_lowest_chunk_sector(): no lock protection, but the return
  value is bounded within the bio's stripe range, so the worst case
  is suboptimal I/O ordering — no data corruption or lost I/O.

Needs fix:

- raid5_read_one_chunk(): fixed by this patch.

- retry_aligned_read(): has a similar issue. I will fix it with
  gen_lock seqcount in the next version of this patch.

- raid5_bitmap_sector(): if the check sees LOC_NO_RESHAPE but reshape
  starts and passes this region before the stripe is processed, the
  bitmap position will not match the layout used for the write.

- make_discard_request(): checks mddev->reshape_position and computes
  logical sectors from the layout fields, all without any lock.

- raid5_make_request(): the lockless reshape_progress check that decides
  on_wq can race with reshape start.

I think there are two possible solutions:

1. All callers should check reshape within proper locking
   (conf->device_lock or conf->gen_lock).

2. Suspend I/O during start_reshape via mddev_suspend() /
   mddev_resume(), so readers do not need to worry about seeing an
   inconsistent state.

I am going with direction 1 for now since gen_lock seems to be
designed for exactly this kind of race.

Disclaimer: The contents of this e-mail message and any attachments are confidential and are intended solely for addressee. The information may also be legally privileged. This transmission is sent in trust, for the sole purpose of delivery to the intended recipient. If you have received this transmission in error, any use, reproduction or dissemination of this transmission is strictly prohibited. If you are not the intended recipient, please immediately notify the sender by reply e-mail or phone and delete this message and its attachments, if any.