Re: [RFC PATCH v4 1/8] iomap: zeroing needs to be pagecache aware

From: Brian Foster
Date: Fri May 31 2024 - 11:43:26 EST


On Fri, May 31, 2024 at 07:03:58AM -0700, Darrick J. Wong wrote:
> On Fri, May 31, 2024 at 06:11:25AM -0700, Christoph Hellwig wrote:
> > On Wed, May 29, 2024 at 05:51:59PM +0800, Zhang Yi wrote:
> > > XXX: how do we detect a iomap containing a cow mapping over a hole
> > > in iomap_zero_iter()? The XFS code implies this case also needs to
> > > zero the page cache if there is data present, so trigger for page
> > > cache lookup only in iomap_zero_iter() needs to handle this case as
> > > well.
> >
> > If there is no data in the page cache and either a whole or unwritten
> > extent it really should not matter what is in the COW fork, a there
> > obviously isn't any data we could zero.
> >
> > If there is data in the page cache for something that is marked as
> > a hole in the srcmap, but we have data in the COW fork due to
> > COW extsize preallocation we'd need to zero it, but as the
> > xfs iomap ops don't return a separate srcmap for that case we
> > should be fine. Or am I missing something?
>
> It might be useful to skip the scan for dirty pagecache if both forks
> have holes, since (in theory) that's never possible on xfs.
>
> OTOH maybe there are filesystems that allow dirty pagecache over a hole?
>

IIRC there was a case where dirty cache can exist over what is reported
as a hole to zero range. I want to say it was something like a COW
prealloc over a data fork hole followed by a buffered write and then a
zero range, but I don't recall the details. That is all something that
should be fixed on the lookup side anyways.

Brian

> > > + * Note: when zeroing unwritten extents, we might have data in the page cache
> > > + * over an unwritten extent. In this case, we want to do a pure lookup on the
> > > + * page cache and not create a new folio as we don't need to perform zeroing on
> > > + * unwritten extents if there is no cached data over the given range.
> > > */
> > > struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
> > > {
> > > fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
> > >
> > > + if (iter->flags & IOMAP_ZERO) {
> > > + const struct iomap *srcmap = iomap_iter_srcmap(iter);
> > > +
> > > + if (srcmap->type == IOMAP_UNWRITTEN)
> > > + fgp &= ~FGP_CREAT;
> > > + }
> >
> > Nit: The comment would probably stand out a little better if it was
> > right next to the IOMAP_ZERO conditional instead of above the
> > function.
>
> Agreed.
>
> > > + if (status) {
> > > + if (status == -ENOENT) {
> > > + /*
> > > + * Unwritten extents need to have page cache
> > > + * lookups done to determine if they have data
> > > + * over them that needs zeroing. If there is no
> > > + * data, we'll get -ENOENT returned here, so we
> > > + * can just skip over this index.
> > > + */
> > > + WARN_ON_ONCE(srcmap->type != IOMAP_UNWRITTEN);
> >
> > I'd return -EIO if the WARN_ON triggers.
> >
> > > +loop_continue:
> >
> > While I'm no strange to gotos for loop control something trips me
> > up about jumping to the end of the loop. Here is what I could come
> > up with instead. Not arguing it's objectively better, but I somehow
> > like it a little better:
> >
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index 700b22d6807783..81378f7cd8d7ff 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -1412,49 +1412,56 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> > bool ret;
> >
> > status = iomap_write_begin(iter, pos, bytes, &folio);
> > - if (status) {
> > - if (status == -ENOENT) {
> > - /*
> > - * Unwritten extents need to have page cache
> > - * lookups done to determine if they have data
> > - * over them that needs zeroing. If there is no
> > - * data, we'll get -ENOENT returned here, so we
> > - * can just skip over this index.
> > - */
> > - WARN_ON_ONCE(srcmap->type != IOMAP_UNWRITTEN);
> > - if (bytes > PAGE_SIZE - offset_in_page(pos))
> > - bytes = PAGE_SIZE - offset_in_page(pos);
> > - goto loop_continue;
> > - }
> > + if (status && status != -ENOENT)
> > return status;
> > - }
> > - if (iter->iomap.flags & IOMAP_F_STALE)
> > - break;
> >
> > - offset = offset_in_folio(folio, pos);
> > - if (bytes > folio_size(folio) - offset)
> > - bytes = folio_size(folio) - offset;
> > + if (status == -ENOENT) {
> > + /*
> > + * If we end up here, we did not find a folio in the
> > + * page cache for an unwritten extent and thus can
> > + * skip over the range.
> > + */
> > + if (WARN_ON_ONCE(srcmap->type != IOMAP_UNWRITTEN))
> > + return -EIO;
> >
> > - /*
> > - * If the folio over an unwritten extent is clean (i.e. because
> > - * it has been read from), then it already contains zeros. Hence
> > - * we can just skip it.
> > - */
> > - if (srcmap->type == IOMAP_UNWRITTEN &&
> > - !folio_test_dirty(folio)) {
> > - folio_unlock(folio);
> > - goto loop_continue;
> > + /*
> > + * XXX: It would be nice if we could get the offset of
> > + * the next entry in the pagecache so that we don't have
> > + * to iterate one page at a time here.
> > + */
> > + offset = offset_in_page(pos);
> > + if (bytes > PAGE_SIZE - offset)
> > + bytes = PAGE_SIZE - offset;
>
> Why is it PAGE_SIZE here and not folio_size() like below?
>
> (I know you're just copying the existing code; I'm merely wondering if
> this is some minor bug.)
>
> --D
>
> > + } else {
> > + if (iter->iomap.flags & IOMAP_F_STALE)
> > + break;
> > +
> > + offset = offset_in_folio(folio, pos);
> > + if (bytes > folio_size(folio) - offset)
> > + bytes = folio_size(folio) - offset;
> > +
> > + /*
> > + * If the folio over an unwritten extent is clean (i.e.
> > + * because it has only been read from), then it already
> > + * contains zeros. Hence we can just skip it.
> > + */
> > + if (srcmap->type == IOMAP_UNWRITTEN &&
> > + !folio_test_dirty(folio)) {
> > + folio_unlock(folio);
> > + status = -ENOENT;
> > + }
> > }
> >
> > - folio_zero_range(folio, offset, bytes);
> > - folio_mark_accessed(folio);
> > + if (status != -ENOENT) {
> > + folio_zero_range(folio, offset, bytes);
> > + folio_mark_accessed(folio);
> >
> > - ret = iomap_write_end(iter, pos, bytes, bytes, folio);
> > - __iomap_put_folio(iter, pos, bytes, folio);
> > - if (WARN_ON_ONCE(!ret))
> > - return -EIO;
> > + ret = iomap_write_end(iter, pos, bytes, bytes, folio);
> > + __iomap_put_folio(iter, pos, bytes, folio);
> > + if (WARN_ON_ONCE(!ret))
> > + return -EIO;
> > + }
> >
> > -loop_continue:
> > pos += bytes;
> > length -= bytes;
> > written += bytes;
> >
>