Re: [RFC PATCH v2 1/8] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features
From: Zhang Yi
Date: Fri Feb 07 2025 - 07:23:19 EST
On 2025/1/29 0:46, John Garry wrote:
> On 15/01/2025 11:46, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
>>
>> Currently, it's hard to know whether the storage device supports unmap
>> write zeroes. We cannot determine it only by checking if the disk
>> supports the write zeroes command, as for some HDDs that do submit
>> actual zeros to the disk media even if they claim to support the write
>> zeroes command, but that should be very slow.
>
> This second sentence is too long, such that your meaning is hard to understand.
>
>>
>> Therefor, add a new queue limit feature, BLK_FEAT_WRITE_ZEROES_UNMAP and
>
> Therefore?
>
>> the corresponding sysfs entry, to indicate whether the block device
>> explicitly supports the unmapped write zeroes command. Each device
>> driver should set this bit if it is certain that the attached disk
>> supports this command.
>
> How can they be certain? You already wrote that some claim to support it, yet don't really. Well, I think that is what you meant.
>
Hi, John. thanks for your reply!
Sorry for the late and not make it clear enough earlier. Currently, there
are four situations of write zeroes command (aka REQ_OP_WRITE_ZEROES)
supported by various disks and backend storage devices.
A. Devices that do not support the write zeroes command
These devices have bdev_limits(bdev)->max_write_zeroes_sectors set to
zero.
B. Devices that support the write zeroes command
These devices have bdev_limits(bdev)->max_write_zeroes_sectors set to a
non-zero value. They can be further categorized into three
sub-situations:
B.1. Devices that write physical zeroes to the media
These devices perform the write zeroes operation by physically writing
zeroes to the storage media, which can be very slow (e.g., HDDs).
B.2. Devices that support unmap write zeroes
These devices can offload the write zeroes operation by unmapping the
logical blocks, effectively putting them into a deallocated state
(e.g., SSDs). This operation is typically very fast, allowing
filesystems to use this command to quickly create zeroed files. NVMe
and SCSI disk drivers already support this and can query the attached
disks to determine whether they support unmap write zeroes (please see
patches 2 and 3 for details).
B.3. The implementation of write zeroes on disks are unknown
This category includes non-standard disks and some network storage
devices where the exact implementation of the write zeroes command is
unclear.
Currently, users can only distinguish A and B through querying
/sys/block/<disk>/queue/write_zeroes_unmap
However, we cannot differentiate between the three sub-situations within B.
At the very least, we should aim to clearly distinguish between B.1 and B.2
on the standard HDDs and SSDs. After this series, we should set
BLK_FEAT_WRITE_ZEROES_UNMAP for devices in B.2, while leaving it unset for
the other three cases.
>> If the bit is not set, the disk either does not
>> support it, or its support status is unknown.
>>
>> For the stacked devices cases, the BLK_FEAT_WRITE_ZEROES_UNMAP should be
>> supported both by the stacking driver and all underlying devices.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
>> ---
>> Documentation/ABI/stable/sysfs-block | 14 ++++++++++++++
>> block/blk-settings.c | 6 ++++++
>> block/blk-sysfs.c | 3 +++
>> include/linux/blkdev.h | 3 +++
>> 4 files changed, 26 insertions(+)
>>
>> diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
>> index 0cceb2badc83..ab4117cefd9a 100644
>> --- a/Documentation/ABI/stable/sysfs-block
>> +++ b/Documentation/ABI/stable/sysfs-block
>> @@ -722,6 +722,20 @@ Description:
>> 0, write zeroes is not supported by the device.
>> +What: /sys/block/<disk>/queue/write_zeroes_unmap
>> +Date: January 2025
>> +Contact: Zhang Yi <yi.zhang@xxxxxxxxxx>
>> +Description:
>> + [RO] Devices that explicitly support the unmap write zeroes
>> + operation in which a single write zeroes request with the unmap
>> + bit set to zero out the range of contiguous blocks on storage
>
> which bit are you referring to?
Different disks and drivers have different control bit. For NVMe devices,
the control bit is NVME_WZ_DEAC, while for SCSI SSDs, it is set via
"cmd->cmnd[1] = 0x8", please see nvme_setup_write_zeroes() and
sd_setup_write_same[16|10]_cmnd() for details.
>
>> + by freeing blocks, rather than writing physical zeroes to the
>> + media. If write_zeroes_unmap is 1, this indicates that the
>> + device explicitly supports the write zero command. Otherwise,
>> + the device either does not support it, or its support status is
>> + unknown.
>
> I am struggling to understand the full meaning of a value of '0'.
>
> Does it means that either:
> a. it does not support write zero
> b. it does support write zero, yet just did not set BLK_FEAT_WRITE_ZEROES_UNMAP
>
Yes, if it is set to 0, it should be the cases A, B.1 and B.3 I
mentioned above.
>> +
>> +
>> What: /sys/block/<disk>/queue/zone_append_max_bytes
>> Date: May 2020
>> Contact: linux-block@xxxxxxxxxxxxxxx
>> diff --git a/block/blk-settings.c b/block/blk-settings.c
>> index 8f09e33f41f6..a8bf2f8f0634 100644
>> --- a/block/blk-settings.c
>> +++ b/block/blk-settings.c
>> @@ -652,6 +652,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>> t->features &= ~BLK_FEAT_NOWAIT;
>> if (!(b->features & BLK_FEAT_POLL))
>> t->features &= ~BLK_FEAT_POLL;
>> + if (!(b->features & BLK_FEAT_WRITE_ZEROES_UNMAP))
>> + t->features &= ~BLK_FEAT_WRITE_ZEROES_UNMAP;
>
> Why not just set this in BLK_FEAT_INHERIT_MASK? It's seems like a sensible thing to do...
>
No, we can't inherit BLK_FEAT_WRITE_ZEROES_UNMAP if any of the stacking
driver and underlying device not support unmap write zeroes. The
stacking driver should set it by default, and drop it if one unsupported
device attached, please see patch 4 for the device-mapper case.
>> t->flags |= (b->flags & BLK_FLAG_MISALIGNED);
>> @@ -774,6 +776,10 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>> t->zone_write_granularity = 0;
>> t->max_zone_append_sectors = 0;
>> }
>> +
>> + if (!t->max_write_zeroes_sectors)
>> + t->features &= ~BLK_FEAT_WRITE_ZEROES_UNMAP;
>> +
>> blk_stack_atomic_writes_limits(t, b);
>> return ret;
>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>> index 767598e719ab..13f22bee19d2 100644
>> --- a/block/blk-sysfs.c
>> +++ b/block/blk-sysfs.c
>> @@ -248,6 +248,7 @@ static ssize_t queue_##_name##_show(struct gendisk *disk, char *page) \
>> QUEUE_SYSFS_FEATURE_SHOW(poll, BLK_FEAT_POLL);
>> QUEUE_SYSFS_FEATURE_SHOW(fua, BLK_FEAT_FUA);
>> QUEUE_SYSFS_FEATURE_SHOW(dax, BLK_FEAT_DAX);
>> +QUEUE_SYSFS_FEATURE_SHOW(write_zeroes_unmap, BLK_FEAT_WRITE_ZEROES_UNMAP);
>> static ssize_t queue_zoned_show(struct gendisk *disk, char *page)
>> {
>> @@ -468,6 +469,7 @@ QUEUE_RO_ENTRY(queue_atomic_write_unit_min, "atomic_write_unit_min_bytes");
>> QUEUE_RO_ENTRY(queue_write_same_max, "write_same_max_bytes");
>> QUEUE_RO_ENTRY(queue_max_write_zeroes_sectors, "write_zeroes_max_bytes");
>> +QUEUE_RO_ENTRY(queue_write_zeroes_unmap, "write_zeroes_unmap");
>> QUEUE_RO_ENTRY(queue_max_zone_append_sectors, "zone_append_max_bytes");
>> QUEUE_RO_ENTRY(queue_zone_write_granularity, "zone_write_granularity");
>> @@ -615,6 +617,7 @@ static struct attribute *queue_attrs[] = {
>> &queue_poll_delay_entry.attr,
>> &queue_virt_boundary_mask_entry.attr,
>> &queue_dma_alignment_entry.attr,
>> + &queue_write_zeroes_unmap_entry.attr,
>> NULL,
>> };
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index 378d3a1a22fc..14ba1e2709bb 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -335,6 +335,9 @@ typedef unsigned int __bitwise blk_features_t;
>> #define BLK_FEAT_ATOMIC_WRITES_STACKED \
>> ((__force blk_features_t)(1u << 16))
>> +/* supports unmap write zeroes command */
>> +#define BLK_FEAT_WRITE_ZEROES_UNMAP ((__force blk_features_t)(1u << 17))
>
> Is this flag ever checked within the kernel?
>
> If not, I assume your idea is that the user checks this flag via sysfs for the block device which the fs is mounted on just to know if FALLOC_FL_WRITE_ZEROES is definitely fast, right?
You are right.
Thanks,
Yi.