Re: [PATCH v4 6/6] io_uring: add support for zone-append

From: Damien Le Moal
Date: Fri Aug 14 2020 - 04:27:20 EST


On 2020/08/14 17:14, hch@xxxxxxxxxxxxx wrote:
> On Wed, Aug 05, 2020 at 07:35:28AM +0000, Damien Le Moal wrote:
>>> the write pointer. The only interesting addition is that we also want
>>> to report where we wrote. So I'd rather have RWF_REPORT_OFFSET or so.
>>
>> That works for me. But that rules out having the same interface for raw block
>> devices since O_APPEND has no meaning in that case. So for raw block devices, it
>> will have to be through zonefs. That works for me, and I think it was your idea
>> all along. Can you confirm please ?
>
> Yes. I don't think think raw syscall level access to the zone append
> primitive makes sense. Either use zonefs for a file-like API, or
> use the NVMe pass through interface for 100% raw access.
>
>>> - take the exclusive per-inode (zone) lock and just issue either normal
>>> writes or zone append at your choice, relying on the lock to
>>> serialize other writers. For the async case this means we need a
>>> lock than can be release in a different context than it was acquired,
>>> which is a little ugly but can be done.
>>
>> Yes, that would be possible. But likely, this will also need calls to
>> inode_dio_wait() to avoid ending up with a mix of regular write and zone append
>> writes in flight (which likely would result in the regular write failing as the
>> zone append writes would go straight to the device without waiting for the zone
>> write lock like regular writes do).
>
> inode_dio_wait is a really bad implementation of almost a lock. I've
> started some work that I need to finish to just replace it with proper
> non-owner rwsems (or even the range locks Dave has been looking into).

OK.

>> This all sound sensible to me. One last point though, specific to zonefs: if the
>> user opens a zone file with O_APPEND, I do want to have that necessarily mean
>> "use zone append". And same for the "RWF_REPORT_OFFSET". The point here is that
>> both O_APPEND and RWF_REPORT_OFFSET can be used with both regular writes and
>> zone append writes, but none of them actually clearly specify if the
>> application/user tolerates writing data to disk in a different order than the
>> issuing order... So another flag to indicate "atomic out-of-order writes" (==
>> zone append) ?
>
> O_APPEND pretty much implies out of order, as there is no way for an
> application to know which thread wins the race to write the next chunk.

Yes and no. If the application threads do not synchronize their calls to
io_submit(), then yes indeed, things can get out of order. But if the
application threads are synchronized, then the offset set for each append AIO
will be in sequence of submission, so the user will not see its writes
completing at different write offsets than this implied offsets.

If O_APPEND is the sole flag that triggers the use of zone append, then we loose
this current implied and predictable positioning of the writes. Even for a
single thread by the way.

Hence my thinking to preserve this, meaning that O_APPEND alone will see zonefs
using regular writes. O_APPEND or RWF_APPEND + RWF_SOME_NICELY_NAMED_FLAG_for_ZA
would trigger the use of zone append, with the implied effect that writes may
not end up in the same order as they are submitted. So the flag name could be:
RWF_RELAXED_ORDER or something like that (I am bad at naming...).

And we also have RWF_REPORT_OFFSET which applies to all cases of append writes,
regardless of the command used.

Does this make sense ?


--
Damien Le Moal
Western Digital Research