Re: [PATCH 3/3] io_uring: add support for zone-append

From: Jens Axboe
Date: Fri Jun 19 2020 - 10:15:10 EST


On 6/19/20 3:41 AM, javier.gonz@xxxxxxxxxxx wrote:
> Jens,
>
> Would you have time to answer a question below in this thread?
>
> On 18.06.2020 11:11, javier.gonz@xxxxxxxxxxx wrote:
>> On 18.06.2020 08:47, Damien Le Moal wrote:
>>> On 2020/06/18 17:35, javier.gonz@xxxxxxxxxxx wrote:
>>>> On 18.06.2020 07:39, Damien Le Moal wrote:
>>>>> On 2020/06/18 2:27, Kanchan Joshi wrote:
>>>>>> From: Selvakumar S <selvakuma.s1@xxxxxxxxxxx>
>>>>>>
>>>>>> Introduce three new opcodes for zone-append -
>>>>>>
>>>>>> IORING_OP_ZONE_APPEND : non-vectord, similiar to IORING_OP_WRITE
>>>>>> IORING_OP_ZONE_APPENDV : vectored, similar to IORING_OP_WRITEV
>>>>>> IORING_OP_ZONE_APPEND_FIXED : append using fixed-buffers
>>>>>>
>>>>>> Repurpose cqe->flags to return zone-relative offset.
>>>>>>
>>>>>> Signed-off-by: SelvaKumar S <selvakuma.s1@xxxxxxxxxxx>
>>>>>> Signed-off-by: Kanchan Joshi <joshi.k@xxxxxxxxxxx>
>>>>>> Signed-off-by: Nitesh Shetty <nj.shetty@xxxxxxxxxxx>
>>>>>> Signed-off-by: Javier Gonzalez <javier.gonz@xxxxxxxxxxx>
>>>>>> ---
>>>>>> fs/io_uring.c | 72 +++++++++++++++++++++++++++++++++++++++++--
>>>>>> include/uapi/linux/io_uring.h | 8 ++++-
>>>>>> 2 files changed, 77 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>>>> index 155f3d8..c14c873 100644
>>>>>> --- a/fs/io_uring.c
>>>>>> +++ b/fs/io_uring.c
>>>>>> @@ -649,6 +649,10 @@ struct io_kiocb {
>>>>>> unsigned long fsize;
>>>>>> u64 user_data;
>>>>>> u32 result;
>>>>>> +#ifdef CONFIG_BLK_DEV_ZONED
>>>>>> + /* zone-relative offset for append, in bytes */
>>>>>> + u32 append_offset;
>>>>>
>>>>> this can overflow. u64 is needed.
>>>>
>>>> We chose to do it this way to start with because struct io_uring_cqe
>>>> only has space for u32 when we reuse the flags.
>>>>
>>>> We can of course create a new cqe structure, but that will come with
>>>> larger changes to io_uring for supporting append.
>>>>
>>>> Do you believe this is a better approach?
>>>
>>> The problem is that zone size are 32 bits in the kernel, as a number
>>> of sectors. So any device that has a zone size smaller or equal to
>>> 2^31 512B sectors can be accepted. Using a zone relative offset in
>>> bytes for returning zone append result is OK-ish, but to match the
>>> kernel supported range of possible zone size, you need 31+9 bits...
>>> 32 does not cut it.
>>
>> Agree. Our initial assumption was that u32 would cover current zone size
>> requirements, but if this is a no-go, we will take the longer path.
>
> Converting to u64 will require a new version of io_uring_cqe, where we
> extend at least 32 bits. I believe this will need a whole new allocation
> and probably ioctl().
>
> Is this an acceptable change for you? We will of course add support for
> liburing when we agree on the right way to do this.

If you need 64-bit of return value, then it's not going to work. Even
with the existing patches, reusing cqe->flags isn't going to fly, as
it would conflict with eg doing zone append writes with automatic
buffer selection.

We're not changing the io_uring_cqe. It's important to keep it lean, and
any other request type is generally fine with 64-bit tag + 32-bit result
(and 32-bit flags on the side) for completions.

Only viable alternative I see would be to provide an area to store this
information, and pass in a pointer to this at submission time through
the sqe. One issue I do see with that is if we only have this
information available at completion time, then we'd need some async punt
to copy it to user space... Generally not ideal.

A hackier approach would be to play some tricks with cqe->res and
cqe->flags, setting aside a flag to denote an extension of cqe->res.
That would mean excluding zone append (etc) from using buffer selection,
which probably isn't a huge deal. It'd be more problematic for any other
future flags. But if you just need 40 bits, then it could certainly
work. Rigth now, if cqe->flags & 1 is set, then (cqe->flags >> 16) is
the buffer ID. You could define IORING_CQE_F_ZONE_FOO to be bit 1, so
that:

uint64_t val = cqe->res; // assuming non-error here

if (cqe->flags & IORING_CQE_F_ZONE_FOO)
val |= (cqe->flags >> 16) << 32ULL;

and hence use the upper 16 bits of cqe->flags for the upper bits of your
(then) 48-bit total value.

--
Jens Axboe