Re: [PATCH v4 05/11] block: Add core atomic write support

From: IBM
Date: Sun Feb 25 2024 - 07:10:02 EST


John Garry <john.g.garry@xxxxxxxxxx> writes:

> Add atomic write support as follows:
> - report request_queue atomic write support limits to sysfs and udpate Doc
> - add helper functions to get request_queue atomic write limits
> - support to safely merge atomic writes
> - add a per-request atomic write flag
> - deal with splitting atomic writes
> - misc helper functions
>
> New sysfs files are added to report the following atomic write limits:
> - atomic_write_boundary_bytes
> - atomic_write_max_bytes
> - atomic_write_unit_max_bytes
> - atomic_write_unit_min_bytes
>
> atomic_write_unit_{min,max}_bytes report the min and max atomic write
> support size, inclusive, and are primarily dictated by HW capability. Both
> values must be a power-of-2. atomic_write_boundary_bytes, if non-zero,
> indicates an LBA space boundary at which an atomic write straddles no
> longer is atomically executed by the disk. atomic_write_max_bytes is the
> maximum merged size for an atomic write. Often it will be the same value as
> atomic_write_unit_max_bytes.

Instead of explaining sysfs outputs which are deriviatives of HW
and request_queue limits (and also defined in Documentation), maybe we
could explain how those sysfs values are derived instead -

struct queue_limits {
<...>
unsigned int atomic_write_hw_max_sectors;
unsigned int atomic_write_max_sectors;
unsigned int atomic_write_hw_boundary_sectors;
unsigned int atomic_write_hw_unit_min_sectors;
unsigned int atomic_write_unit_min_sectors;
unsigned int atomic_write_hw_unit_max_sectors;
unsigned int atomic_write_unit_max_sectors;
<...>

1. atomic_write_unit_hw_max_sectors comes directly from hw and it need
not be a power of 2.

2. atomic_write_hw_unit_min_sectors and atomic_write_hw_unit_max_sectors
is again defined/derived from hw limits, but it is rounded down so that
it is always a power of 2.

3. atomic_write_hw_boundary_sectors again comes from HW boundary limit.
It could either be 0 (which means the device specify no boundary limit) or a
multiple of unit_max. It need not be power of 2, however the current
code assumes it to be a power of 2 (check callers of blk_queue_atomic_write_boundary_bytes())

4. atomic_write_max_sectors, atomic_write_unit_min_sectors
and atomic_write_unit_max_sectors are all derived out of above hw limits
inside function blk_atomic_writes_update_limits() based on request_queue
limits.
a. atomic_write_max_sectors is derived from atomic_write_hw_unit_max_sectors and
request_queue's max_hw_sectors limit. It also guarantees max
sectors that can be fit in a single bio.
b. atomic_write_unit_[min|max]_sectors are derived from atomic_write_hw_unit_[min|max]_sectors,
request_queue's max_hw_sectors & blk_queue_max_guaranteed_bio_sectors(). Both of these limits
are kept as a power of 2.

Now coming to sysfs outputs -
1. atomic_write_unit_max_bytes: Same as atomic_write_unix_max_sectors in bytes
2. atomic_write_unit_min_bytes: Same as atomic_write_unit_min_sectors in bytes
3. atomic_write_boundary_bytes: same as atomic_write_hw_boundary_sectors
in bytes
4. atomic_write_max_bytes: Same as atomic_write_max_sectors in bytes

>
> atomic_write_unit_max_bytes is capped at the maximum data size which we are
> guaranteed to be able to fit in a BIO, as an atomic write must always be
> submitted as a single BIO. This BIO max size is dictated by the number of

Here it says that the atomic write must always be submitted as a single
bio. From where to where? I think you meant from FS to block layer.
Because otherwise we still allow request/bio merging inside block layer
based on the request queue limits we defined above. i.e. bio can be
chained to form
rq->biotail->bi_next = next_rq->bio
as long as the merged requests is within the queue_limits.

i.e. atomic write requests can be merged as long as -
- both rqs have REQ_ATOMIC set
- blk_rq_sectors(final_rq) <= q->limits.atomic_write_max_sectors
- final rq formed should not straddle limits->atomic_write_hw_boundary_sectors

However, splitting of an atomic write requests is not allowed. And if it
happens, we fail the I/O req & return -EINVAL.

> segments allowed which the request queue can support and the number of
> bvecs a BIO can fit, BIO_MAX_VECS. Currently we rely on userspace issuing a
> write with iovcnt=1 for IOV_ITER - as such, we can rely on each segment
> containing PAGE_SIZE of data, apart from the first+last, which each can
> fit logical block size of data. Note that here we rely on the direct IO
> rule for alignment, that each iovec needs to be logical block size
> aligned/length multiple. Atomic writes may be supported for buffered IO in
> future, but it would still make sense to apply that direct IO rule there.
>
> atomic_write_max_sectors is capped at max_hw_sectors, but is not also
> capped at max_sectors. The value in max_sectors can be controlled from
> userspace, and it would only cause trouble if userspace could limit
> atomic_write_unit_max_bytes and the other atomic write limits.
>
> Atomic writes may be merged under the following conditions:
> - total request length <= atomic_write_max_bytes
> - the merged write does not straddle a boundary, if any
>
> It is only permissible to merge an atomic writes with another atomic
> write, i.e. it is not possible to merge an atomic and non-atomic write.
> There are many reasons for this, like:
> - SCSI has a dedicated atomic write command, so a merged atomic and
> non-atomic needs to be issued as an atomic write, putting an unnecessary
> burden on the disk to issue the merged write atomically
> - Dimensions of the merged non-atomic write need to be checked for size/
> offset to conform to atomic write rules, which adds overhead
> - Typically only atomic writes or non-atomic writes are expected for a
> file during normal processing, so not any expected use-case to cater for.
>
> Functions get_max_io_size() and blk_queue_get_max_sectors() are modified to
> handle atomic writes max length - those functions are used by the merge
> code.
>
> An atomic write cannot be split under any circumstances. In the case that
> an atomic write needs to be split, we reject the IO. If any atomic write
> needs to be split, it is most likely because of either:
> - atomic_write_unit_max_bytes reported is incorrect.
> - whoever submitted the atomic write BIO did not properly adhere to the
> request_queue limits.
>
> All atomic writes limits are by default set 0 to indicate no atomic write
> support. Even though it is assumed by Linux that a logical block can always
> be atomically written, we ignore this as it is not of particular interest.
> Stacked devices are just not supported either for now.
>
> Flag REQ_ATOMIC is used for indicating an atomic write.
>
> Helper function bdev_can_atomic_write() is added to indicate whether
> atomic writes may be issued to a bdev. It ensures that if the bdev is a
> partition, that the partition is properly aligned with
> atomic_write_unit_min_sectors and atomic_write_hw_boundary_sectors.

IMHO, the commit message can definitely use a re-write. I agree that you
have put in a lot of information, but I think it can be more organized.

>
> Contains significant contributions from:
> Himanshu Madhani <himanshu.madhani@xxxxxxxxxx>

Myabe it can use a better tag then.
"Documentation/process/submitting-patches.rst"

>
> Signed-off-by: John Garry <john.g.garry@xxxxxxxxxx>
> ---
> Documentation/ABI/stable/sysfs-block | 52 ++++++++++++++
> block/blk-merge.c | 91 ++++++++++++++++++++++-
> block/blk-settings.c | 103 +++++++++++++++++++++++++++
> block/blk-sysfs.c | 33 +++++++++
> block/blk.h | 3 +
> include/linux/blk_types.h | 2 +
> include/linux/blkdev.h | 60 ++++++++++++++++
> 7 files changed, 343 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
> index 1fe9a553c37b..4c775f4bdefe 100644
> --- a/Documentation/ABI/stable/sysfs-block
> +++ b/Documentation/ABI/stable/sysfs-block
> @@ -21,6 +21,58 @@ Description:
> device is offset from the internal allocation unit's
> natural alignment.
>
> +What: /sys/block/<disk>/atomic_write_max_bytes
> +Date: February 2024
> +Contact: Himanshu Madhani <himanshu.madhani@xxxxxxxxxx>
> +Description:
> + [RO] This parameter specifies the maximum atomic write
> + size reported by the device. This parameter is relevant
> + for merging of writes, where a merged atomic write
> + operation must not exceed this number of bytes.
> + This parameter may be greater to the value in
> + atomic_write_unit_max_bytes as
> + atomic_write_unit_max_bytes will be rounded down to a
> + power-of-two and atomic_write_unit_max_bytes may also be
> + limited by some other queue limits, such as max_segments.
> + This parameter - along with atomic_write_unit_min_bytes
> + and atomic_write_unit_max_bytes - will not be larger than
> + max_hw_sectors_kb, but may be larger than max_sectors_kb.
> +
> +
> +What: /sys/block/<disk>/atomic_write_unit_min_bytes
> +Date: February 2024
> +Contact: Himanshu Madhani <himanshu.madhani@xxxxxxxxxx>
> +Description:
> + [RO] This parameter specifies the smallest block which can
> + be written atomically with an atomic write operation. All
> + atomic write operations must begin at a
> + atomic_write_unit_min boundary and must be multiples of
> + atomic_write_unit_min. This value must be a power-of-two.
> +
> +
> +What: /sys/block/<disk>/atomic_write_unit_max_bytes
> +Date: February 2024
> +Contact: Himanshu Madhani <himanshu.madhani@xxxxxxxxxx>
> +Description:
> + [RO] This parameter defines the largest block which can be
> + written atomically with an atomic write operation. This
> + value must be a multiple of atomic_write_unit_min and must
> + be a power-of-two. This value will not be larger than
> + atomic_write_max_bytes.
> +
> +
> +What: /sys/block/<disk>/atomic_write_boundary_bytes
> +Date: February 2024
> +Contact: Himanshu Madhani <himanshu.madhani@xxxxxxxxxx>
> +Description:
> + [RO] A device may need to internally split I/Os which
> + straddle a given logical block address boundary. In that
> + case a single atomic write operation will be processed as
> + one of more sub-operations which each complete atomically.
> + This parameter specifies the size in bytes of the atomic
> + boundary if one is reported by the device. This value must
> + be a power-of-two.
> +
>
> What: /sys/block/<disk>/diskseq
> Date: February 2021
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 74e9e775f13d..12a75a252ca2 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -18,6 +18,42 @@
> #include "blk-rq-qos.h"
> #include "blk-throttle.h"
>

/* A comment explaining this function and arguments could be helpful */

> +static bool rq_straddles_atomic_write_boundary(struct request *rq,
> + unsigned int front,
> + unsigned int back)

A better naming perhaps be start_adjust, end_adjust?

> +{
> + unsigned int boundary = queue_atomic_write_boundary_bytes(rq->q);
> + unsigned int mask, imask;
> + loff_t start, end;

start_rq_pos, end_rq_pos maybe?

> +
> + if (!boundary)
> + return false;
> +
> + start = rq->__sector << SECTOR_SHIFT;

blk_rq_pos(rq) perhaps?

> + end = start + rq->__data_len;

blk_rq_bytes(rq) perhaps? It should be..
> +
> + start -= front;
> + end += back;
> +
> + /* We're longer than the boundary, so must be crossing it */
> + if (end - start > boundary)
> + return true;
> +
> + mask = boundary - 1;
> +
> + /* start/end are boundary-aligned, so cannot be crossing */
> + if (!(start & mask) || !(end & mask))
> + return false;
> +
> + imask = ~mask;
> +
> + /* Top bits are different, so crossed a boundary */
> + if ((start & imask) != (end & imask))
> + return true;

The last condition looks wrong. Shouldn't it be end - 1?

> +
> + return false;
> +}

Can we do something like this?

static bool rq_straddles_atomic_write_boundary(struct request *rq,
unsigned int start_adjust,
unsigned int end_adjust)
{
unsigned int boundary = queue_atomic_write_boundary_bytes(rq->q);
unsigned long boundary_mask;
unsigned long start_rq_pos, end_rq_pos;

if (!boundary)
return false;

start_rq_pos = blk_rq_pos(rq) << SECTOR_SHIFT;
end_rq_pos = start_rq_pos + blk_rq_bytes(rq);

start_rq_pos -= start_adjust;
end_rq_pos += end_adjust;

boundary_mask = boundary - 1;

if ((start_rq_pos | boundary_mask) != (end_rq_pos | boundary_mask))
return true;

return false;
}

I was thinking this check should cover all cases? Thoughts?


> +
> static inline void bio_get_first_bvec(struct bio *bio, struct bio_vec *bv)
> {
> *bv = mp_bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
> @@ -167,7 +203,16 @@ static inline unsigned get_max_io_size(struct bio *bio,
> {
> unsigned pbs = lim->physical_block_size >> SECTOR_SHIFT;
> unsigned lbs = lim->logical_block_size >> SECTOR_SHIFT;
> - unsigned max_sectors = lim->max_sectors, start, end;
> + unsigned max_sectors, start, end;
> +
> + /*
> + * We ignore lim->max_sectors for atomic writes simply because
> + * it may less than the bio size, which we cannot tolerate.
> + */
> + if (bio->bi_opf & REQ_ATOMIC)
> + max_sectors = lim->atomic_write_max_sectors;
> + else
> + max_sectors = lim->max_sectors;
>
> if (lim->chunk_sectors) {
> max_sectors = min(max_sectors,
> @@ -305,6 +350,11 @@ struct bio *bio_split_rw(struct bio *bio, const struct queue_limits *lim,
> *segs = nsegs;
> return NULL;
> split:
> + if (bio->bi_opf & REQ_ATOMIC) {
> + bio->bi_status = BLK_STS_IOERR;
> + bio_endio(bio);
> + return ERR_PTR(-EINVAL);
> + }
> /*
> * We can't sanely support splitting for a REQ_NOWAIT bio. End it
> * with EAGAIN if splitting is required and return an error pointer.
> @@ -645,6 +695,13 @@ int ll_back_merge_fn(struct request *req, struct bio *bio, unsigned int nr_segs)
> return 0;
> }
>
> + if (req->cmd_flags & REQ_ATOMIC) {
> + if (rq_straddles_atomic_write_boundary(req,
> + 0, bio->bi_iter.bi_size)) {
> + return 0;
> + }
> + }
> +
> return ll_new_hw_segment(req, bio, nr_segs);
> }
>
> @@ -664,6 +721,13 @@ static int ll_front_merge_fn(struct request *req, struct bio *bio,
> return 0;
> }
>
> + if (req->cmd_flags & REQ_ATOMIC) {
> + if (rq_straddles_atomic_write_boundary(req,
> + bio->bi_iter.bi_size, 0)) {
> + return 0;
> + }
> + }
> +
> return ll_new_hw_segment(req, bio, nr_segs);
> }
>
> @@ -700,6 +764,13 @@ static int ll_merge_requests_fn(struct request_queue *q, struct request *req,
> blk_rq_get_max_sectors(req, blk_rq_pos(req)))
> return 0;
>
> + if (req->cmd_flags & REQ_ATOMIC) {
> + if (rq_straddles_atomic_write_boundary(req,
> + 0, blk_rq_bytes(next))) {
> + return 0;
> + }
> + }
> +
> total_phys_segments = req->nr_phys_segments + next->nr_phys_segments;
> if (total_phys_segments > blk_rq_get_max_segments(req))
> return 0;
> @@ -795,6 +866,18 @@ static enum elv_merge blk_try_req_merge(struct request *req,
> return ELEVATOR_NO_MERGE;
> }
>
> +static bool blk_atomic_write_mergeable_rq_bio(struct request *rq,
> + struct bio *bio)
> +{
> + return (rq->cmd_flags & REQ_ATOMIC) == (bio->bi_opf & REQ_ATOMIC);
> +}
> +
> +static bool blk_atomic_write_mergeable_rqs(struct request *rq,
> + struct request *next)
> +{
> + return (rq->cmd_flags & REQ_ATOMIC) == (next->cmd_flags & REQ_ATOMIC);
> +}
> +
> /*
> * For non-mq, this has to be called with the request spinlock acquired.
> * For mq with scheduling, the appropriate queue wide lock should be held.
> @@ -814,6 +897,9 @@ static struct request *attempt_merge(struct request_queue *q,
> if (req->ioprio != next->ioprio)
> return NULL;
>
> + if (!blk_atomic_write_mergeable_rqs(req, next))
> + return NULL;
> +
> /*
> * If we are allowed to merge, then append bio list
> * from next to rq and release next. merge_requests_fn
> @@ -941,6 +1027,9 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
> if (rq->ioprio != bio_prio(bio))
> return false;
>
> + if (blk_atomic_write_mergeable_rq_bio(rq, bio) == false)
> + return false;
> +
> return true;
> }
>
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 06ea91e51b8b..176f26374abc 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -59,6 +59,13 @@ void blk_set_default_limits(struct queue_limits *lim)
> lim->zoned = false;
> lim->zone_write_granularity = 0;
> lim->dma_alignment = 511;
> + lim->atomic_write_hw_max_sectors = 0;
> + lim->atomic_write_max_sectors = 0;
> + lim->atomic_write_hw_boundary_sectors = 0;
> + lim->atomic_write_hw_unit_min_sectors = 0;
> + lim->atomic_write_unit_min_sectors = 0;
> + lim->atomic_write_hw_unit_max_sectors = 0;
> + lim->atomic_write_unit_max_sectors = 0;
> }
>
> /**
> @@ -101,6 +108,44 @@ void blk_queue_bounce_limit(struct request_queue *q, enum blk_bounce bounce)
> }
> EXPORT_SYMBOL(blk_queue_bounce_limit);
>
> +
> +/*
> + * Returns max guaranteed sectors which we can fit in a bio. For convenience of
> + * users, rounddown_pow_of_two() the return value.
> + *
> + * We always assume that we can fit in at least PAGE_SIZE in a segment, apart
> + * from first and last segments.
> + */
> +static unsigned int blk_queue_max_guaranteed_bio_sectors(
> + struct queue_limits *limits,
> + struct request_queue *q)
> +{
> + unsigned int max_segments = min(BIO_MAX_VECS, limits->max_segments);
> + unsigned int length;
> +
> + length = min(max_segments, 2) * queue_logical_block_size(q);
> + if (max_segments > 2)
> + length += (max_segments - 2) * PAGE_SIZE;
> +
> + return rounddown_pow_of_two(length >> SECTOR_SHIFT);
> +}
> +
> +static void blk_atomic_writes_update_limits(struct request_queue *q)
> +{
> + struct queue_limits *limits = &q->limits;
> + unsigned int max_hw_sectors =
> + rounddown_pow_of_two(limits->max_hw_sectors);
> + unsigned int unit_limit = min(max_hw_sectors,
> + blk_queue_max_guaranteed_bio_sectors(limits, q));
> +
> + limits->atomic_write_max_sectors =
> + min(limits->atomic_write_hw_max_sectors, max_hw_sectors);
> + limits->atomic_write_unit_min_sectors =
> + min(limits->atomic_write_hw_unit_min_sectors, unit_limit);
> + limits->atomic_write_unit_max_sectors =
> + min(limits->atomic_write_hw_unit_max_sectors, unit_limit);
> +}
> +
> /**
> * blk_queue_max_hw_sectors - set max sectors for a request for this queue
> * @q: the request queue for the device
> @@ -145,6 +190,8 @@ void blk_queue_max_hw_sectors(struct request_queue *q, unsigned int max_hw_secto
> limits->logical_block_size >> SECTOR_SHIFT);
> limits->max_sectors = max_sectors;
>
> + blk_atomic_writes_update_limits(q);
> +
> if (!q->disk)
> return;
> q->disk->bdi->io_pages = max_sectors >> (PAGE_SHIFT - 9);
> @@ -182,6 +229,62 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
> }
> EXPORT_SYMBOL(blk_queue_max_discard_sectors);
>
> +/**
> + * blk_queue_atomic_write_max_bytes - set max bytes supported by
> + * the device for atomic write operations.
> + * @q: the request queue for the device
> + * @bytes: maximum bytes supported
> + */
> +void blk_queue_atomic_write_max_bytes(struct request_queue *q,
> + unsigned int bytes)
> +{
> + q->limits.atomic_write_hw_max_sectors = bytes >> SECTOR_SHIFT;
> + blk_atomic_writes_update_limits(q);
> +}
> +EXPORT_SYMBOL(blk_queue_atomic_write_max_bytes);
> +
> +/**
> + * blk_queue_atomic_write_boundary_bytes - Device's logical block address space
> + * which an atomic write should not cross.
> + * @q: the request queue for the device
> + * @bytes: must be a power-of-two.
> + */
> +void blk_queue_atomic_write_boundary_bytes(struct request_queue *q,
> + unsigned int bytes)
> +{
> + q->limits.atomic_write_hw_boundary_sectors = bytes >> SECTOR_SHIFT;
> +}
> +EXPORT_SYMBOL(blk_queue_atomic_write_boundary_bytes);
> +
> +/**
> + * blk_queue_atomic_write_unit_min_sectors - smallest unit that can be written
> + * atomically to the device.
> + * @q: the request queue for the device
> + * @sectors: must be a power-of-two.
> + */
> +void blk_queue_atomic_write_unit_min_sectors(struct request_queue *q,
> + unsigned int sectors)
> +{
> +
> + q->limits.atomic_write_hw_unit_min_sectors = sectors;
> + blk_atomic_writes_update_limits(q);
> +}
> +EXPORT_SYMBOL(blk_queue_atomic_write_unit_min_sectors);
> +
> +/*
> + * blk_queue_atomic_write_unit_max_sectors - largest unit that can be written
> + * atomically to the device.
> + * @q: the request queue for the device
> + * @sectors: must be a power-of-two.
> + */
> +void blk_queue_atomic_write_unit_max_sectors(struct request_queue *q,
> + unsigned int sectors)
> +{
> + q->limits.atomic_write_hw_unit_max_sectors = sectors;
> + blk_atomic_writes_update_limits(q);
> +}
> +EXPORT_SYMBOL(blk_queue_atomic_write_unit_max_sectors);
> +
> /**
> * blk_queue_max_secure_erase_sectors - set max sectors for a secure erase
> * @q: the request queue for the device
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 6b2429cad81a..3978f14f9769 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -118,6 +118,30 @@ static ssize_t queue_max_discard_segments_show(struct request_queue *q,
> return queue_var_show(queue_max_discard_segments(q), page);
> }
>
> +static ssize_t queue_atomic_write_max_bytes_show(struct request_queue *q,
> + char *page)
> +{
> + return queue_var_show(queue_atomic_write_max_bytes(q), page);
> +}
> +
> +static ssize_t queue_atomic_write_boundary_show(struct request_queue *q,
> + char *page)
> +{
> + return queue_var_show(queue_atomic_write_boundary_bytes(q), page);
> +}
> +
> +static ssize_t queue_atomic_write_unit_min_show(struct request_queue *q,
> + char *page)
> +{
> + return queue_var_show(queue_atomic_write_unit_min_bytes(q), page);
> +}
> +
> +static ssize_t queue_atomic_write_unit_max_show(struct request_queue *q,
> + char *page)
> +{
> + return queue_var_show(queue_atomic_write_unit_max_bytes(q), page);
> +}
> +
> static ssize_t queue_max_integrity_segments_show(struct request_queue *q, char *page)
> {
> return queue_var_show(q->limits.max_integrity_segments, page);
> @@ -502,6 +526,11 @@ QUEUE_RO_ENTRY(queue_discard_max_hw, "discard_max_hw_bytes");
> QUEUE_RW_ENTRY(queue_discard_max, "discard_max_bytes");
> QUEUE_RO_ENTRY(queue_discard_zeroes_data, "discard_zeroes_data");
>
> +QUEUE_RO_ENTRY(queue_atomic_write_max_bytes, "atomic_write_max_bytes");
> +QUEUE_RO_ENTRY(queue_atomic_write_boundary, "atomic_write_boundary_bytes");
> +QUEUE_RO_ENTRY(queue_atomic_write_unit_max, "atomic_write_unit_max_bytes");
> +QUEUE_RO_ENTRY(queue_atomic_write_unit_min, "atomic_write_unit_min_bytes");
> +
> QUEUE_RO_ENTRY(queue_write_same_max, "write_same_max_bytes");
> QUEUE_RO_ENTRY(queue_write_zeroes_max, "write_zeroes_max_bytes");
> QUEUE_RO_ENTRY(queue_zone_append_max, "zone_append_max_bytes");
> @@ -629,6 +658,10 @@ static struct attribute *queue_attrs[] = {
> &queue_discard_max_entry.attr,
> &queue_discard_max_hw_entry.attr,
> &queue_discard_zeroes_data_entry.attr,
> + &queue_atomic_write_max_bytes_entry.attr,
> + &queue_atomic_write_boundary_entry.attr,
> + &queue_atomic_write_unit_min_entry.attr,
> + &queue_atomic_write_unit_max_entry.attr,
> &queue_write_same_max_entry.attr,
> &queue_write_zeroes_max_entry.attr,
> &queue_zone_append_max_entry.attr,
> diff --git a/block/blk.h b/block/blk.h
> index 050696131329..6ba8333fcf26 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -178,6 +178,9 @@ static inline unsigned int blk_queue_get_max_sectors(struct request *rq)
> if (unlikely(op == REQ_OP_WRITE_ZEROES))
> return q->limits.max_write_zeroes_sectors;
>
> + if (rq->cmd_flags & REQ_ATOMIC)
> + return q->limits.atomic_write_max_sectors;
> +
> return q->limits.max_sectors;
> }
>
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index f288c94374b3..cd7cceb8565d 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -422,6 +422,7 @@ enum req_flag_bits {
> __REQ_DRV, /* for driver use */
> __REQ_FS_PRIVATE, /* for file system (submitter) use */
>
> + __REQ_ATOMIC, /* for atomic write operations */
> /*
> * Command specific flags, keep last:
> */
> @@ -448,6 +449,7 @@ enum req_flag_bits {
> #define REQ_RAHEAD (__force blk_opf_t)(1ULL << __REQ_RAHEAD)
> #define REQ_BACKGROUND (__force blk_opf_t)(1ULL << __REQ_BACKGROUND)
> #define REQ_NOWAIT (__force blk_opf_t)(1ULL << __REQ_NOWAIT)
> +#define REQ_ATOMIC (__force blk_opf_t)(1ULL << __REQ_ATOMIC)

Let's add this in the same order as of __REQ_ATOMIC i.e. after
REQ_FS_PRIVATE macro

> #define REQ_POLLED (__force blk_opf_t)(1ULL << __REQ_POLLED)
> #define REQ_ALLOC_CACHE (__force blk_opf_t)(1ULL << __REQ_ALLOC_CACHE)
> #define REQ_SWAP (__force blk_opf_t)(1ULL << __REQ_SWAP)
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 99e4f5e72213..40ed56ef4937 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -299,6 +299,14 @@ struct queue_limits {
> unsigned int discard_alignment;
> unsigned int zone_write_granularity;
>
> + unsigned int atomic_write_hw_max_sectors;
> + unsigned int atomic_write_max_sectors;
> + unsigned int atomic_write_hw_boundary_sectors;
> + unsigned int atomic_write_hw_unit_min_sectors;
> + unsigned int atomic_write_unit_min_sectors;
> + unsigned int atomic_write_hw_unit_max_sectors;
> + unsigned int atomic_write_unit_max_sectors;
> +

1 liner comment for above members please?

> unsigned short max_segments;
> unsigned short max_integrity_segments;
> unsigned short max_discard_segments;
> @@ -885,6 +893,14 @@ void blk_queue_zone_write_granularity(struct request_queue *q,
> unsigned int size);
> extern void blk_queue_alignment_offset(struct request_queue *q,
> unsigned int alignment);
> +void blk_queue_atomic_write_max_bytes(struct request_queue *q,
> + unsigned int bytes);
> +void blk_queue_atomic_write_unit_max_sectors(struct request_queue *q,
> + unsigned int sectors);
> +void blk_queue_atomic_write_unit_min_sectors(struct request_queue *q,
> + unsigned int sectors);
> +void blk_queue_atomic_write_boundary_bytes(struct request_queue *q,
> + unsigned int bytes);
> void disk_update_readahead(struct gendisk *disk);
> extern void blk_limits_io_min(struct queue_limits *limits, unsigned int min);
> extern void blk_queue_io_min(struct request_queue *q, unsigned int min);
> @@ -1291,6 +1307,30 @@ static inline int queue_dma_alignment(const struct request_queue *q)
> return q ? q->limits.dma_alignment : 511;
> }
>
> +static inline unsigned int
> +queue_atomic_write_unit_max_bytes(const struct request_queue *q)
> +{
> + return q->limits.atomic_write_unit_max_sectors << SECTOR_SHIFT;
> +}
> +
> +static inline unsigned int
> +queue_atomic_write_unit_min_bytes(const struct request_queue *q)
> +{
> + return q->limits.atomic_write_unit_min_sectors << SECTOR_SHIFT;
> +}
> +
> +static inline unsigned int
> +queue_atomic_write_boundary_bytes(const struct request_queue *q)
> +{
> + return q->limits.atomic_write_hw_boundary_sectors << SECTOR_SHIFT;
> +}
> +
> +static inline unsigned int
> +queue_atomic_write_max_bytes(const struct request_queue *q)
> +{
> + return q->limits.atomic_write_max_sectors << SECTOR_SHIFT;
> +}
> +
> static inline unsigned int bdev_dma_alignment(struct block_device *bdev)
> {
> return queue_dma_alignment(bdev_get_queue(bdev));
> @@ -1540,6 +1580,26 @@ struct io_comp_batch {
> void (*complete)(struct io_comp_batch *);
> };
>
> +static inline bool bdev_can_atomic_write(struct block_device *bdev)
> +{
> + struct request_queue *bd_queue = bdev->bd_queue;
> + struct queue_limits *limits = &bd_queue->limits;
> +
> + if (!limits->atomic_write_unit_min_sectors)
> + return false;
> +
> + if (bdev_is_partition(bdev)) {
> + sector_t bd_start_sect = bdev->bd_start_sect;
> + unsigned int granularity = max(

atomic_align perhaps?

> + limits->atomic_write_unit_min_sectors,
> + limits->atomic_write_hw_boundary_sectors);
> + if (do_div(bd_start_sect, granularity))
> + return false;
> + }

since atomic_align is a power of 2. Why not use IS_ALIGNED()?
(bitwise operation instead of div)?

> +
> + return true;
> +}
> +
> #define DEFINE_IO_COMP_BATCH(name) struct io_comp_batch name = { }
>
> #endif /* _LINUX_BLKDEV_H */
> --
> 2.31.1

-ritesh