Re: [RFC PATCH v2 2/5] iomap: Add initial support for buffered RWF_WRITETHROUGH

From: Ojaswin Mujoo

Date: Thu Apr 23 2026 - 06:09:43 EST

On Wed, Apr 22, 2026 at 12:00:34PM +0200, Jan Kara wrote:
> On Tue 21-04-26 23:37:01, Ojaswin Mujoo wrote:
> > On Mon, Apr 20, 2026 at 01:28:18PM +0200, Jan Kara wrote:
> > > On Sat 18-04-26 01:12:22, Ojaswin Mujoo wrote:
> > > > On Thu, Apr 16, 2026 at 02:34:15PM +0200, Jan Kara wrote:
> > > > > > @@ -1096,6 +1097,276 @@ static bool iomap_write_end(struct iomap_iter *iter, size_t len, size_t copied,
> > > > > > +static int iomap_writethrough_iter(struct iomap_writethrough_ctx *wt_ctx,
> > > > > > + struct iomap_iter *iter, struct iov_iter *i,
> > > > > > + const struct iomap_writethrough_ops *wt_ops)
> > > > > > +
> > > > > > +{
> >
> > <...>
> >
> > > > your comment but) after this email, I started diggin a bit more into why
> > > > it is needed. As per my understanding, it tackles 2 things:
> > > >
> > > > Problem 1. mkclean's the old EOF folio so that the FS can fault again. This
> > > > allows us to allocate new blocks which previously might not be allocated
> > > > if bs < ps.
> > > >
> > > > Problem 2. Since mmap writes can dirty data beyond EOF, we zero the range from
> > > > old EOF to end of that folio so that readers dont read junk data after
> > > > isize extension.
> > >
> > > Correct.
> > >
> > > > Another thing I noticed is that most users of
> > > > iomap_file_buffered_write() do their own eof zeroing in the FS layer
> > > > (eg, xfs_file_write_zero_eof(), ext4's new changes,
> > > > ntfs_extend_initialized_size() etc).
> > > > I think this FS level zerooing should take care of mkcleaning the eof
> > > > folio (problem 1), as they call iomap_zero_range() which would flush the
> > > > eof range anyways. So am I right in assuming that for FSes that do their
> > > > own zeroing, 1. is already taken care of?
> > >
> > > Well, I don't see anything that would writeprotect the old tail page in
> > > iomap_zero_range(). I think iomap_zero_range() calls are there mostly to
> > > address 2. Not only due to mmap but also possibly to clear whatever junk
> > > there can be in the blocks after EOF.
> >
> > Well I was thinking more like if the EOF page was mmap'd it would be
> > dirty and blocks beyond EOF would be unmapped, so iomap_zero_range() will
> > write it back which shall mkclean() the folio.
> >
> > But I think the same race we discussed for problem 2 can also occur
> > here.
> >
> > Thread 1 (extending write) Thread 2 (mmap writer)
> >
> > iomap_zero_range()
> > filemap_write_and_wait_range()
> > // mmaps & writes EOF range
> > iomap_write_iter()
> > isize = new_size
> > // pagecache_isize_extended() is
> > needed to mkclean() old EOF page.
>
> Yes, this race exists and unlike in the case of zeroing where it is mostly
> harmless not guranteeing calling page_mkwrite() with updated i_size can
> lead to filesystem tripping on assertions, data loss or similar.

Right.

>
> > > > As for 2, I think after the EOF zeroing of the FS, there might be a
> > > > window before iomap_write_iter() where an mmap writer can still dirty
> > > > EOF blocks, hence the pagecache_isize_extended() would be needed here.
> > > > But doesn't that then make the eof zeroing in the FS layer redundant? Am
> > > > I missing something here?
> > >
> > > Hmm, I agree the zeroing looks duplicit (for some users of
> > > pagecache_isize_extended()). And yes, doing the zeroing from
> > > xfs_file_write_zero_eof() is somewhat racy (mmap writer can still come and
> > > write non-zeros before we update i_size) but I'd have hard time to argue it
> > > really practically matters - you are racing mmap writes with buffered
> > > writes so any kind of write atomicity guarantees are not there.
> >
> > Yeah, seems like it is not enough to take care of either 1 or 2 and
> > pagecache_isize_extended() should maybe be enough. I was just wondering
> > if we could optimize it away even for normal extend path (no racing mmap),
> > we can avoid the expensive folio_zero_range() calls.
> >
> > Regardless, Ive not looked at this more closely and its a separate issue
> > so we can revisit it later. For now I wanted some clarity around
> > pagecache_isize_extended() so thanks for that.
>
> Well, but pagecache_isize_extended() doesn't guarantee on disk blocks are
> zeroed out as well as it doesn't dirty the page. Also
> xfs_file_write_zero_eof() potentially handles zeroing of more than a tail
> page. So you cannot simply drop one of these.

Hmm yea true, xfs_file_write_zero_eof() does take of zeroing any mapped
eof blocks on disk as well.

>
> > > > Regardless, for our case I think we will also need to do the
> > > > pagecache_isize_extended(), mainly to take care of problem 2, but where
> > > > exactly should we do it now? We currently change the isize in endio()
> > > > but for aio, it can run outside inode or folio lock. I think this
> > > > function needs to be called under inode lock(). Hmm.. its a bit late here so
> > > > I'll revisit this tomorrow with a fresh mind :)
> > >
> > > I think mainly to take care of problem 1... You are correct about
> > > inode_lock but since we are updating i_size, we should be better holding
> > > it, shouldn't we?
> >
> > Yes you are correct. In the aio writethrough codepath, the inode update
> > is happening without the inode lock which is wrong. I overlooked the
> > fact that even aio dio uses IOMAP_DIO_FORCE_WAIT to force isize update
> > under inode lock, and we should do something similar as well.
>
> Yes.
>
> > So in v3, I make the change that for extending writes we shall always
> > finish them in "sync" fashion so ->endio() runs under inode lock. Then,
> > after ->endio() in iomap_dio_complete(), I will call

( I meant iomap_writethrough_complete(), here)

> > pagecache_isize_extended() to take care of this. Just like isize update
> > right now, the isize_extension only runs when the IO was successful
> > otherwise we return an error to the user. This gives us semantics like
> > dio while handling extension properly.
> >
> > Does that sound okay?
>
> Yep, sounds fine.

Got it, thanks for the review Jan!

Regards,
ojaswin

>
> Honza
> --
> Jan Kara <jack@xxxxxxxx>
> SUSE Labs, CR