Re: [PATCH v4 05/11] block: Add core atomic write support

From: John Garry
Date: Mon Feb 26 2024 - 04:55:46 EST


On 25/02/2024 12:09, Ritesh Harjani (IBM) wrote:
John Garry <john.g.garry@xxxxxxxxxx> writes:

Add atomic write support as follows:
- report request_queue atomic write support limits to sysfs and udpate Doc
- add helper functions to get request_queue atomic write limits
- support to safely merge atomic writes
- add a per-request atomic write flag
- deal with splitting atomic writes
- misc helper functions

New sysfs files are added to report the following atomic write limits:
- atomic_write_boundary_bytes
- atomic_write_max_bytes
- atomic_write_unit_max_bytes
- atomic_write_unit_min_bytes

atomic_write_unit_{min,max}_bytes report the min and max atomic write
support size, inclusive, and are primarily dictated by HW capability. Both
values must be a power-of-2. atomic_write_boundary_bytes, if non-zero,
indicates an LBA space boundary at which an atomic write straddles no
longer is atomically executed by the disk. atomic_write_max_bytes is the
maximum merged size for an atomic write. Often it will be the same value as
atomic_write_unit_max_bytes.

Instead of explaining sysfs outputs which are deriviatives of HW
and request_queue limits (and also defined in Documentation), maybe we
could explain how those sysfs values are derived instead -

struct queue_limits {
<...>
unsigned int atomic_write_hw_max_sectors;
unsigned int atomic_write_max_sectors;
unsigned int atomic_write_hw_boundary_sectors;
unsigned int atomic_write_hw_unit_min_sectors;
unsigned int atomic_write_unit_min_sectors;
unsigned int atomic_write_hw_unit_max_sectors;
unsigned int atomic_write_unit_max_sectors;
<...>

1. atomic_write_unit_hw_max_sectors comes directly from hw and it need
not be a power of 2.

2. atomic_write_hw_unit_min_sectors and atomic_write_hw_unit_max_sectors
is again defined/derived from hw limits, but it is rounded down so that
it is always a power of 2.

3. atomic_write_hw_boundary_sectors again comes from HW boundary limit.
It could either be 0 (which means the device specify no boundary limit) or a
multiple of unit_max. It need not be power of 2, however the current
code assumes it to be a power of 2 (check callers of blk_queue_atomic_write_boundary_bytes())

4. atomic_write_max_sectors, atomic_write_unit_min_sectors
and atomic_write_unit_max_sectors are all derived out of above hw limits
inside function blk_atomic_writes_update_limits() based on request_queue
limits.
a. atomic_write_max_sectors is derived from atomic_write_hw_unit_max_sectors and
request_queue's max_hw_sectors limit. It also guarantees max
sectors that can be fit in a single bio.
b. atomic_write_unit_[min|max]_sectors are derived from atomic_write_hw_unit_[min|max]_sectors,
request_queue's max_hw_sectors & blk_queue_max_guaranteed_bio_sectors(). Both of these limits
are kept as a power of 2.

Now coming to sysfs outputs -
1. atomic_write_unit_max_bytes: Same as atomic_write_unix_max_sectors in bytes
2. atomic_write_unit_min_bytes: Same as atomic_write_unit_min_sectors in bytes
3. atomic_write_boundary_bytes: same as atomic_write_hw_boundary_sectors
in bytes
4. atomic_write_max_bytes: Same as atomic_write_max_sectors in bytes


ok, I can look to incorporate the advised formatting changes


atomic_write_unit_max_bytes is capped at the maximum data size which we are
guaranteed to be able to fit in a BIO, as an atomic write must always be
submitted as a single BIO. This BIO max size is dictated by the number of

Here it says that the atomic write must always be submitted as a single
bio. From where to where?

submitted to the block layer/core

I think you meant from FS to block layer.

sure, or also block device file operations (in fops.c) to block core

Because otherwise we still allow request/bio merging inside block layer
based on the request queue limits we defined above. i.e. bio can be
chained to form
rq->biotail->bi_next = next_rq->bio
as long as the merged requests is within the queue_limits.

i.e. atomic write requests can be merged as long as -
- both rqs have REQ_ATOMIC set
- blk_rq_sectors(final_rq) <= q->limits.atomic_write_max_sectors
- final rq formed should not straddle limits->atomic_write_hw_boundary_sectors

However, splitting of an atomic write requests is not allowed. And if it
happens, we fail the I/O req & return -EINVAL.

..


IMHO, the commit message can definitely use a re-write. I agree that you
have put in a lot of information, but I think it can be more organized.#

ok, fine. I'll look at this. Thanks.



Contains significant contributions from:
Himanshu Madhani <himanshu.madhani@xxxxxxxxxx>

Myabe it can use a better tag then.
"Documentation/process/submitting-patches.rst"

ok



Signed-off-by: John Garry <john.g.garry@xxxxxxxxxx>
---
Documentation/ABI/stable/sysfs-block | 52 ++++++++++++++
block/blk-merge.c | 91 ++++++++++++++++++++++-
block/blk-settings.c | 103 +++++++++++++++++++++++++++
block/blk-sysfs.c | 33 +++++++++
block/blk.h | 3 +
include/linux/blk_types.h | 2 +
include/linux/blkdev.h | 60 ++++++++++++++++
7 files changed, 343 insertions(+), 1 deletion(-)

diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index 1fe9a553c37b..4c775f4bdefe 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -21,6 +21,58 @@ Description:
device is offset from the internal allocation unit's
natural alignment.

..


/* A comment explaining this function and arguments could be helpful */

already addressed according to earlier review


+static bool rq_straddles_atomic_write_boundary(struct request *rq,
+ unsigned int front,
+ unsigned int back)

A better naming perhaps be start_adjust, end_adjust?

ok


+{
+ unsigned int boundary = queue_atomic_write_boundary_bytes(rq->q);
+ unsigned int mask, imask;
+ loff_t start, end;

start_rq_pos, end_rq_pos maybe?

ok


+
+ if (!boundary)
+ return false;
+
+ start = rq->__sector << SECTOR_SHIFT;

blk_rq_pos(rq) perhaps?

ok


+ end = start + rq->__data_len;

blk_rq_bytes(rq) perhaps? It should be..

ok

+
+ start -= front;
+ end += back;
+
+ /* We're longer than the boundary, so must be crossing it */
+ if (end - start > boundary)
+ return true;
+
+ mask = boundary - 1;
+
+ /* start/end are boundary-aligned, so cannot be crossing */
+ if (!(start & mask) || !(end & mask))
+ return false;
+
+ imask = ~mask;
+
+ /* Top bits are different, so crossed a boundary */
+ if ((start & imask) != (end & imask))
+ return true;

The last condition looks wrong. Shouldn't it be end - 1?

+
+ return false;
+}

Can we do something like this?

static bool rq_straddles_atomic_write_boundary(struct request *rq,
unsigned int start_adjust,
unsigned int end_adjust)
{
unsigned int boundary = queue_atomic_write_boundary_bytes(rq->q);
unsigned long boundary_mask;
unsigned long start_rq_pos, end_rq_pos;

if (!boundary)
return false;

start_rq_pos = blk_rq_pos(rq) << SECTOR_SHIFT;
end_rq_pos = start_rq_pos + blk_rq_bytes(rq);

start_rq_pos -= start_adjust;
end_rq_pos += end_adjust;

boundary_mask = boundary - 1;

if ((start_rq_pos | boundary_mask) != (end_rq_pos | boundary_mask))
return true;

return false;
}

I was thinking this check should cover all cases? Thoughts?

that looks ok (apart from issue already detected later). It is quite similar to how I coded it in the NVMe driver, apart from the initial > boundary check.

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index f288c94374b3..cd7cceb8565d 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -422,6 +422,7 @@ enum req_flag_bits {
__REQ_DRV, /* for driver use */
__REQ_FS_PRIVATE, /* for file system (submitter) use */

+ __REQ_ATOMIC, /* for atomic write operations */
/*
* Command specific flags, keep last:
*/
@@ -448,6 +449,7 @@ enum req_flag_bits {
#define REQ_RAHEAD (__force blk_opf_t)(1ULL << __REQ_RAHEAD)
#define REQ_BACKGROUND (__force blk_opf_t)(1ULL << __REQ_BACKGROUND)
#define REQ_NOWAIT (__force blk_opf_t)(1ULL << __REQ_NOWAIT)
+#define REQ_ATOMIC (__force blk_opf_t)(1ULL << __REQ_ATOMIC)

Let's add this in the same order as of __REQ_ATOMIC i.e. after
REQ_FS_PRIVATE macro

ok, fine

>> @@ -299,6 +299,14 @@ struct queue_limits {
>> unsigned int discard_alignment;
>> unsigned int zone_write_granularity;
>>
>> + unsigned int atomic_write_hw_max_sectors;
>> + unsigned int atomic_write_max_sectors;
>> + unsigned int atomic_write_hw_boundary_sectors;
>> + unsigned int atomic_write_hw_unit_min_sectors;
>> + unsigned int atomic_write_unit_min_sectors;
>> + unsigned int atomic_write_hw_unit_max_sectors;
>> + unsigned int atomic_write_unit_max_sectors;
>> +
> 1 liner comment for above members please?

ok


+static inline bool bdev_can_atomic_write(struct block_device *bdev)
+{
+ struct request_queue *bd_queue = bdev->bd_queue;
+ struct queue_limits *limits = &bd_queue->limits;
+
+ if (!limits->atomic_write_unit_min_sectors)
+ return false;
+
+ if (bdev_is_partition(bdev)) {
+ sector_t bd_start_sect = bdev->bd_start_sect;
+ unsigned int granularity = max(

atomic_align perhaps?

or just "align"


+ limits->atomic_write_unit_min_sectors,
+ limits->atomic_write_hw_boundary_sectors);
+ if (do_div(bd_start_sect, granularity))
+ return false;
+ }

since atomic_align is a power of 2. Why not use IS_ALIGNED()?
(bitwise operation instead of div)?

already changed as advised

Thanks,
John