Re: [RFC 1/3] iomap: Support buffered RWF_WRITETHROUGH via async dio backend

From: Ojaswin Mujoo

Date: Fri Mar 13 2026 - 03:44:38 EST


On Wed, Mar 11, 2026 at 11:05:05PM +1100, Dave Chinner wrote:
> On Wed, Mar 11, 2026 at 04:05:29PM +0530, Ojaswin Mujoo wrote:
> > On Tue, Mar 10, 2026 at 05:48:12PM +1100, Dave Chinner wrote:
> > > On Mon, Mar 09, 2026 at 11:04:31PM +0530, Ojaswin Mujoo wrote:
> > > This is not what I envisiaged write-through using DIO to look like.
> > > This is a DIO per folio, rather than a DIO per write() syscall. We
> > > want the latter to be the common case, not the former, especially
> > > when it comes to RWF_ATOMIC support.
> > >
> > > i.e. I was expecting something more like having a wt context
> > > allocated up front with an appropriately sized bvec appended to it
> > > (i.e. single allocation for the common case). Then in
> > > iomap_write_end(), we'd mark the folio as under writeback and add it
> > > to the bvec. Then we iterate through the IO range adding folio after
> > > folio to the bvec.
> > >
> > > When the bvec is full or we reach the end of the IO, we then push
> > > that bvec down to the DIO code. Ideally we'd also push the iomap we
> > > already hold down as well, so that the DIO code does not need to
> > > look it up again (unless the mapping is stale). The DIO completion
> > > callback then runs a completion callback that iterates the folios
> > > attached ot the bvec and runs buffered writeback compeltion on them.
> > > It can then decrements the wt-ctx IO-in-flight counter.
> > >
> > > If there is more user data to submit, we keep going around (with a
> > > new bvec if we need it) adding folios and submitting them to the dio
> > > code until there is no more data to copy in and submit.
> > >
> > > The writethrough context then drops it's own "in-flight" reference
> > > and waits for the in-flight counter to go to zero.
> >
> > Hi Dave,
> >
> > Thanks for the review. IIUC you are suggesting a per iomap submission of
> > dio rather than a per folio,
>
> Yes, this is the original architectural premise of iomap: we map the
> extent first, then iterate over folios, then submit a single bio for
> the extent...
>
> > and for each iomap we submit we can
> > maintain a per writethrough counter which helps us perform any sort of
> > endio cleanup work. I can give this design a try in v2.
>
> Yes, this is exactly how iomap DIO completion tracking works for
> IO that requires multiple bios to be submitted. i.e. completion
> processing only runs once all IOs -and submission- have completed.
>
> > > > index c24d94349ca5..f4d8ff08a83a 100644
> > > > --- a/fs/iomap/direct-io.c
> > > > +++ b/fs/iomap/direct-io.c
> > > > @@ -713,7 +713,8 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > > > dio->i_size = i_size_read(inode);

<...>

> > currently.
>
> For a write() style syscall, yes. For AIO/io_uring, no.
>
> io_submit() only returns an error if there is something wrong
> with the aio ctx or iocbs being submitted. It does not report
> completion status of the iocbs that are submitted. You need to call
> io_getevents() to obtain the completion status of individual iocbs
> that have been submitted via io_submit().
>
> Think about it: if you submit 16 IO in on io_submit() call and
> one fails, how do you know find out which IO failed?
>
> > However, with our idea of making the DSYNC buffered aio also
> > truly async, via writethrough, won't we be violating this guarantee?
>
> No, the error will be returned to the AIO completion ring, same as
> it is now.

Thanks for the pointers Dave, I now have a decent picture of how
O_DSYNC/RWF_DSYNC IO will look like with writethrough. I'll try to
incorporate this in the next version, along with your other suggestions.

Regards,
ojaswin

>
> -Dave.
> --
> Dave Chinner
> dgc@xxxxxxxxxx