Re: [PATCH 0/3] zone-append support in aio and io-uring

From: Damien Le Moal
Date: Thu Jun 18 2020 - 23:08:39 EST


On 2020/06/19 2:55, Kanchan Joshi wrote:
> On Wed, Jun 17, 2020 at 11:56:34PM -0700, Christoph Hellwig wrote:
>> On Wed, Jun 17, 2020 at 10:53:36PM +0530, Kanchan Joshi wrote:
>>> This patchset enables issuing zone-append using aio and io-uring direct-io interface.
>>>
>>> For aio, this introduces opcode IOCB_CMD_ZONE_APPEND. Application uses start LBA
>>> of the zone to issue append. On completion 'res2' field is used to return
>>> zone-relative offset.
>>>
>>> For io-uring, this introduces three opcodes: IORING_OP_ZONE_APPEND/APPENDV/APPENDV_FIXED.
>>> Since io_uring does not have aio-like res2, cqe->flags are repurposed to return zone-relative offset
>>
>> And what exactly are the semantics supposed to be? Remember the
>> unix file abstractions does not know about zones at all.
>>
>> I really don't think squeezing low-level not quite block storage
>> protocol details into the Linux read/write path is a good idea.
>
> I was thinking of raw block-access to zone device rather than pristine file
> abstraction. And in that context, semantics, at this point, are unchanged
> (i.e. same as direct writes) while flexibility of async-interface gets
> added.

The aio->aio_offset use by the user and kernel differs for regular writes and
zone append writes. This is a significant enough change to say that semantic
changed. Yes both cases are direct IOs, but specification of the write location
by the user and where the data actually lands on disk are different.

There are a lot of subtle things that can happen that makes mapping of zone
append operations to POSIX semantic difficult. E.g. for a regular file, using
zone append for any write issued to a file open with O_APPEND maps well to POSIX
only for blocking writes. For asynchronous writes, that is not true anymore
since the order of data defined by the automatic append after the previous async
write breaks: data can land anywhere in the zone regardless of the offset
specified on submission.

> Synchronous-writes on single-zone sound fine, but synchronous-appends on
> single-zone do not sound that fine.

Why not ? This is a perfectly valid use case that actually does not have any
semantic problem. It indeed may not be the most effective method to get high
performance but saying that it is "not fine" is not correct in my opinion.

>
>> What could be a useful addition is a way for O_APPEND/RWF_APPEND writes
>> to report where they actually wrote, as that comes close to Zone Append
>> while still making sense at our usual abstraction level for file I/O.
>
> Thanks for suggesting this. O and RWF_APPEND may not go well with block
> access as end-of-file will be picked from dev inode. But perhaps a new
> flag like RWF_ZONE_APPEND can help to transform writes (aio or uring)
> into append without introducing new opcodes.

Yes, RWF_ZONE_APPEND may be better if the semantic of RWF_APPEND cannot be
cleanly reused. But as Christoph said, RWF_ZONE_APPEND semantic need to be
clarified so that all reviewer can check the code against the intended behavior,
and comment on that intended behavior too.

> And, I think, this can fit fine on file-abstraction of ZoneFS as well.

May be. Depends on what semantic you are after for user zone append interface.
Ideally, we should have at least the same for raw block device and zonefs. But
zonefs may be able to do a better job thanks to its real regular file
abstraction of zones. As Christoph said, we started looking into it but lacked
time to complete this work. This is still on-going.

--
Damien Le Moal
Western Digital Research