Re: [PATCH v4 6/6] io_uring: add support for zone-append

From: Damien Le Moal
Date: Wed Aug 05 2020 - 03:35:36 EST

On 2020/07/31 21:51, hch@xxxxxxxxxxxxx wrote:
> On Fri, Jul 31, 2020 at 10:16:49AM +0000, Damien Le Moal wrote:
>>> Let's keep semantics and implementation separate. For the case
>>> where we report the actual offset we need a size imitation and no
>>> short writes.
>> OK. So the name of the flag confused me. The flag name should reflect "Do zone
>> append and report written offset", right ?
> Well, we already have O_APPEND, which is the equivalent to append to
> the write pointer. The only interesting addition is that we also want
> to report where we wrote. So I'd rather have RWF_REPORT_OFFSET or so.

That works for me. But that rules out having the same interface for raw block
devices since O_APPEND has no meaning in that case. So for raw block devices, it
will have to be through zonefs. That works for me, and I think it was your idea
all along. Can you confirm please ?

>> But I think I am starting to see the picture you are drawing here:
>> 1) Introduce a fcntl() to get "maximum size for atomic append writes"
>> 2) Introduce an aio flag specifying "Do atomic append write and report written
>> offset"
> I think we just need the 'report written offset flag', in fact even for
> zonefs with the right locking we can handle unlimited write sizes, just
> with lower performance. E.g.
> 1) check if the write size is larger than the zone append limit
> if no:
> - take the shared lock and issue a zone append, done
> if yes:
> - take the exclusive per-inode (zone) lock and just issue either normal
> writes or zone append at your choice, relying on the lock to
> serialize other writers. For the async case this means we need a
> lock than can be release in a different context than it was acquired,
> which is a little ugly but can be done.

Yes, that would be possible. But likely, this will also need calls to
inode_dio_wait() to avoid ending up with a mix of regular write and zone append
writes in flight (which likely would result in the regular write failing as the
zone append writes would go straight to the device without waiting for the zone
write lock like regular writes do).

This all sound sensible to me. One last point though, specific to zonefs: if the
user opens a zone file with O_APPEND, I do want to have that necessarily mean
"use zone append". And same for the "RWF_REPORT_OFFSET". The point here is that
both O_APPEND and RWF_REPORT_OFFSET can be used with both regular writes and
zone append writes, but none of them actually clearly specify if the
application/user tolerates writing data to disk in a different order than the
issuing order... So another flag to indicate "atomic out-of-order writes" (==
zone append) ?

Without a new flag, we can only have these cases to enable zone append:

1) No O_APPEND: ignore RWF_REPORT_OFFSET and use regular writes using ->aio_ofst

2) O_APPEND without RWF_REPORT_OFFSET: use regular write with file size as
->aio_ofst (as done today already), do not report the write offset on completion.

3) O_APPEND with RWF_REPORT_OFFSET: use zone append, implying potentially out of
order writes.

And case (3) is not nice. I would rather prefer something like:

3) O_APPEND with RWF_REPORT_OFFSET: use regular write with file size as
->aio_ofst (as done today already), report the write offset on completion.

4) O_APPEND with RWF_ATOMIC_APPEND: use zone append writes, do not report the
write offset on completion.

5) O_APPEND with RWF_ATOMIC_APPEND+RWF_REPORT_OFFSET: use zone append writes,
report the write offset on completion.

RWF_ATOMIC_APPEND could also always imply RWF_REPORT_OFFSET. ANy aio with
RWF_ATOMIC_APPEND that is too large would be failed.

Thoughts ?

Damien Le Moal
Western Digital Research