Re: [PATCH v5 2/2] md: allow configuring logical block size

From: Li Nan
Date: Thu Sep 25 2025 - 04:34:37 EST

Next message: Madhur Kumar: "[PATCH v2 1/2] selftests/acct: add cleanup for leftover process_log binary"
Previous message: Fedor Pchelkin: "Re: [lvc-project] [PATCH] gpu: i915: fix error return in mmap_offset_attach()"
Next in thread: Li Nan: "Re: [PATCH v5 2/2] md: allow configuring logical block size"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

在 2025/9/23 22:06, Xiao Ni 写道:

On Tue, Sep 23, 2025 at 9:37 PM Li Nan <linan666@xxxxxxxxxxxxxxx> wrote:

在 2025/9/23 19:36, Xiao Ni 写道:

Hi Li Nan

On Thu, Sep 18, 2025 at 8:08 PM <linan666@xxxxxxxxxxxxxxx> wrote:

From: Li Nan <linan122@xxxxxxxxxx>

Previously, raid array used the maximum logical block size (LBS)
of all member disks. Adding a larger LBS disk at runtime could
unexpectedly increase RAID's LBS, risking corruption of existing
partitions. This can be reproduced by:

```
# LBS of sd[de] is 512 bytes, sdf is 4096 bytes.
mdadm -CRq /dev/md0 -l1 -n3 /dev/sd[de] missing --assume-clean

# LBS is 512
cat /sys/block/md0/queue/logical_block_size

# create partition md0p1
parted -s /dev/md0 mklabel gpt mkpart primary 1MiB 100%
lsblk | grep md0p1

# LBS becomes 4096 after adding sdf
mdadm --add -q /dev/md0 /dev/sdf
cat /sys/block/md0/queue/logical_block_size

# partition lost
partprobe /dev/md0
lsblk | grep md0p1
```

Thanks for the reproducer. I can reproduce it myself.

Simply restricting larger-LBS disks is inflexible. In some scenarios,
only disks with 512 bytes LBS are available currently, but later, disks
with 4KB LBS may be added to the array.

If we add a disk with 4KB LBS and configure it to 4KB by the sysfs
interface, how can we make the partition table readable and avoid the
problem mentioned above?

Hi

Thanks for your review.

The main cause of partition loss is LBS changes. Therefore, we should
specify a 4K LBS at creation time, instead of modifying LBS after the RAID
is already in use. For example:

mdadm -C --logical-block-size=4096 ...

In this way, even if all underlying disks are 512-byte, the RAID will be
created with a 4096 LBS. Adding 4096-byte disks later will not cause any
issues.

It can work. But it looks strange to me to set LBS to 4096 but all
devices' LBS is 512 bytes. I don't reject it anyway :)

Making LBS configurable is the best way to solve this scenario.
After this patch, the raid will:
- store LBS in disk metadata
- add a read-write sysfs 'mdX/logical_block_size'

Future mdadm should support setting LBS via metadata field during RAID
creation and the new sysfs. Though the kernel allows runtime LBS changes,
users should avoid modifying it after creating partitions or filesystems
to prevent compatibility issues.

Only 1.x metadata supports configurable LBS. 0.90 metadata inits all
fields to default values at auto-detect. Supporting 0.90 would require
more extensive changes and no such use case has been observed.

Note that many RAID paths rely on PAGE_SIZE alignment, including for
metadata I/O. A larger LBS than PAGE_SIZE will result in metadata
read/write failures. So this config should be prevented.

Signed-off-by: Li Nan <linan122@xxxxxxxxxx>
---
Documentation/admin-guide/md.rst | 7 +++
drivers/md/md.h | 1 +
include/uapi/linux/raid/md_p.h | 3 +-
drivers/md/md-linear.c | 1 +
drivers/md/md.c | 75 ++++++++++++++++++++++++++++++++
drivers/md/raid0.c | 1 +
drivers/md/raid1.c | 1 +
drivers/md/raid10.c | 1 +
drivers/md/raid5.c | 1 +
9 files changed, 90 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
index 1c2eacc94758..f5c81fad034a 100644
--- a/Documentation/admin-guide/md.rst
+++ b/Documentation/admin-guide/md.rst
@@ -238,6 +238,13 @@ All md devices contain:
the number of devices in a raid4/5/6, or to support external
metadata formats which mandate such clipping.

+ logical_block_size
+ Configures the array's logical block size in bytes. This attribute
+ is only supported for RAID1, RAID5, RAID10 with 1.x meta. The value

s/RAID5/RAID456/g

I will fix it later. Thanks.

+ should be written before starting the array. The final array LBS
+ will use the max value between this configuration and all rdev's LBS.
+ Note that LBS cannot exceed PAGE_SIZE.
+
reshape_position
This is either ``none`` or a sector number within the devices of
the array where ``reshape`` is up to. If this is set, the three
diff --git a/drivers/md/md.h b/drivers/md/md.h
index afb25f727409..b0147b98c8d3 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -432,6 +432,7 @@ struct mddev {
sector_t array_sectors; /* exported array size */
int external_size; /* size managed
* externally */
+ unsigned int logical_block_size;
__u64 events;
/* If the last 'event' was simply a clean->dirty transition, and
* we didn't write it to the spares, then it is safe and simple
diff --git a/include/uapi/linux/raid/md_p.h b/include/uapi/linux/raid/md_p.h
index ac74133a4768..310068bb2a1d 100644
--- a/include/uapi/linux/raid/md_p.h
+++ b/include/uapi/linux/raid/md_p.h
@@ -291,7 +291,8 @@ struct mdp_superblock_1 {
__le64 resync_offset; /* data before this offset (from data_offset) known to be in sync */
__le32 sb_csum; /* checksum up to devs[max_dev] */
__le32 max_dev; /* size of devs[] array to consider */
- __u8 pad3[64-32]; /* set to 0 when writing */
+ __le32 logical_block_size; /* same as q->limits->logical_block_size */
+ __u8 pad3[64-36]; /* set to 0 when writing */

/* device state information. Indexed by dev_number.
* 2 bytes per device
diff --git a/drivers/md/md-linear.c b/drivers/md/md-linear.c
index 5d9b08115375..da8babb8da59 100644
--- a/drivers/md/md-linear.c
+++ b/drivers/md/md-linear.c
@@ -72,6 +72,7 @@ static int linear_set_limits(struct mddev *mddev)

md_init_stacking_limits(&lim);
lim.max_hw_sectors = mddev->chunk_sectors;
+ lim.logical_block_size = mddev->logical_block_size;
lim.max_write_zeroes_sectors = mddev->chunk_sectors;
lim.io_min = mddev->chunk_sectors << 9;
err = mddev_stack_rdev_limits(mddev, &lim, MDDEV_STACK_INTEGRITY);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 40f56183c744..e0184942c8ec 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1963,6 +1963,7 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *freshest, struc
mddev->layout = le32_to_cpu(sb->layout);
mddev->raid_disks = le32_to_cpu(sb->raid_disks);
mddev->dev_sectors = le64_to_cpu(sb->size);
+ mddev->logical_block_size = le32_to_cpu(sb->logical_block_size);
mddev->events = ev1;
mddev->bitmap_info.offset = 0;
mddev->bitmap_info.space = 0;
@@ -2172,6 +2173,7 @@ static void super_1_sync(struct mddev *mddev, struct md_rdev *rdev)
sb->chunksize = cpu_to_le32(mddev->chunk_sectors);
sb->level = cpu_to_le32(mddev->level);
sb->layout = cpu_to_le32(mddev->layout);
+ sb->logical_block_size = cpu_to_le32(mddev->logical_block_size);
if (test_bit(FailFast, &rdev->flags))
sb->devflags |= FailFast1;
else
@@ -5900,6 +5902,66 @@ static struct md_sysfs_entry md_serialize_policy =
__ATTR(serialize_policy, S_IRUGO | S_IWUSR, serialize_policy_show,
serialize_policy_store);

+static int mddev_set_logical_block_size(struct mddev *mddev,
+ unsigned int lbs)
+{
+ int err = 0;
+ struct queue_limits lim;
+
+ if (queue_logical_block_size(mddev->gendisk->queue) >= lbs) {
+ pr_err("%s: incompatible logical_block_size %u, can not set\n",
+ mdname(mddev), lbs);

Is it better to print the mddev's LBS and give the message "it can't
set lbs smaller than mddev logical block size"?

I agree. Let me improve this.

+ return -EINVAL;
+ }
+
+ lim = queue_limits_start_update(mddev->gendisk->queue);
+ lim.logical_block_size = lbs;
+ pr_info("%s: logical_block_size is changed, data may be lost\n",
+ mdname(mddev));
+ err = queue_limits_commit_update(mddev->gendisk->queue, &lim);
+ if (err)
+ return err;
+
+ mddev->logical_block_size = lbs;
+ return 0;
+}
+
+static ssize_t
+lbs_show(struct mddev *mddev, char *page)
+{
+ return sprintf(page, "%u\n", mddev->logical_block_size);
+}
+
+static ssize_t
+lbs_store(struct mddev *mddev, const char *buf, size_t len)
+{
+ unsigned int lbs;
+ int err = -EBUSY;
+
+ /* Only 1.x meta supports configurable LBS */
+ if (mddev->major_version == 0)
+ return -EINVAL;

It looks like it should check raid level here as doc mentioned above, right?

Yeah, kuai suggests supporting this feature only in 1.x meta.

I mean it should check if raid is raid0 here, right? As doc mentioned,
it should return error if raid is level 0.

Regards
Xiao

Apologies — I misunderstood. I will add check in v6.

--
Thanks,
Nan

.

--
Thanks,
Nan

Next message: Madhur Kumar: "[PATCH v2 1/2] selftests/acct: add cleanup for leftover process_log binary"
Previous message: Fedor Pchelkin: "Re: [lvc-project] [PATCH] gpu: i915: fix error return in mmap_offset_attach()"
Next in thread: Li Nan: "Re: [PATCH v5 2/2] md: allow configuring logical block size"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]