Re: [RFC PATCH 1/1] iomap: fix race between iomap_set_range_uptodate and folio_end_read

From: Joanne Koong

Date: Thu Feb 12 2026 - 19:54:15 EST

On Thu, Feb 12, 2026 at 11:31 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>
> On Wed, Feb 11, 2026 at 03:13:48PM -0800, Joanne Koong wrote:
> > On Wed, Feb 11, 2026 at 1:03 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> > >
> > > On Wed, Feb 11, 2026 at 11:33:05AM -0800, Joanne Koong wrote:
> > > > ifs->read_bytes_pending gets initialized to the folio size, but if the
> > > > file being read in is smaller than the size of the folio, then we
> > > > reach this scenario because the file has been read in but
> > > > ifs->read_bytes_pending is still a positive value because it
> > > > represents the bytes between the end of the file and the end of the
> > > > folio. If the folio size is 16k and the file size is 4k:
> > > > a) ifs->read_bytes_pending gets initialized to 16k
> > > > b) ->read_folio_range() is called for the 4k read
> > > > c) the 4k read succeeds, ifs->read_bytes_pending is now 12k and the
> > > > 0 to 4k range is marked uptodate
> > > > d) the post-eof blocks are zeroed and marked uptodate in the call to
> > > > iomap_set_range_uptodate()
> > >
> > > This is the bug then. If they're marked uptodate, read_bytes_pending
> > > should be decremented at the same time. Now, I appreciate that
> > > iomap_set_range_uptodate() is called both from iomap_read_folio_iter()
> > > and __iomap_write_begin(), and it can't decrement read_bytes_pending
> > > in the latter case. Perhaps a flag or a second length parameter is
> > > the solution?
> >
> > I don't think it's enough to decrement read_bytes_pending by the
> > zeroed/read-inline length because there's these two edge cases:
> > a) some blocks in the folio were already uptodate from the very
> > beginning and skipped for IO but not decremented yet from
> > ifs->read_bytes_pending, which means in iomap_read_end(),
> > ifs->read_bytes_pending would be > 0 and the uptodate flag could get
> > XORed again. This means we need to also decrement read_bytes_pending
> > by bytes_submitted as well for this case
>
> Hm, that's a good one. It can't happen for readahead, but it can happen
> if we start out by writing to some blocks of a folio, then call
> read_folio to get the remaining blocks uptodate. We could avoid it
> happening by initialising read_bytes_pending to folio_size() -
> bitmap_weight(ifs->uptodate) * block_size.

This is an interesting idea but if we do this then I think this adds
some more edge cases. For example, the range being inlined or zeroed
may have some already uptodate blocks (eg from a prior buffered write)
so we'll need to calculate how many already-existing uptodate bytes
there are in that range to avoid over-decrementing
ifs->read_bytes_pending. I think we would also have to move the
ifs_alloc() and iomap_read_init() calls to the very beginning of
iomap_read_folio_iter() before any iomap_read_inline_data() call
because there could be the case where a folio has an ifs that was
allocated from a prior write, so if we call iomap_finish_folio_read()
after iomap_read_inline_data(), the folio's ifs->read_bytes_pending
now must be initialized before the inline read. Whereas before, we had
some more optimal behavior with being able to entirely skip the ifs
allocation and read initialization if the entire folio gets read
inline.

>
> > b) the async ->read_folio_range() callback finishes after the
> > zeroing's read_bytes_pending decrement and calls folio_end_read(), so
> > we need to assign ctx->cur_folio to NULL
>
> If we return 'finished' from iomap_finish_folio_read(), we can handle
> this?

I think there is still this scenario:
- ->read_folio gets called on an 8k-size folio for a 4k-size file
- iomap_read_init() is called, ifs->read_bytes_pending is now 8k
- make async ->read_folio_range() call to read in 4k
- iomap zeroes out folio from 4k to 8k, then calls
iomap_finish_folio_read() with off = 4k and len = 4k
- in iomap_finish_folio_read(), decrement ifs->read_bytes_pending by
len. ifs->read_bytes_pending is now 4k
- async ->read_folio_range() completes read, calls
iomap_finish_folio_read() with off=0 and len = 4k, which now
decrements ifs->read_bytes_pending by 4k. read_bytes_pending is now 0,
so folio_end_read() gets called. folio should now not be touched by
iomap
- iomap still has valid ctx->cur_folio, and calls iomap_read_end on
ctx->cur_folio

This is the same issue as the one in
https://lore.kernel.org/linux-fsdevel/20260126224107.2182262-2-joannelkoong@xxxxxxxxx/

We could always set ctx->cur_folio to NULL after inline/zeroing calls
iomap_finish_folio_read() regardless of whether it actually ended the
read or not, but then this runs into issues for zeroing. The zeroing
can be triggered by non-EOF cases, eg if the first mapping is an
IOMAP_HOLE and then the rest of hte folio is mapped. We may still need
to read in the rest of the folio, so we can't just set ctx->cur_folio
to NULL. i guess one workaround is to explicitly check if the zeroing
is for IOMAP_MAPPED types and if so then always set ctx->cur_folio to
NULL, but I think this just gets uglier / more complex to understand
and I'm not sure if there's other edge cases I'm missing that we'd
need to account for. One other idea is to try avoiding the
iomap_end_read() call for non-error cases if we use your
bitmap_weight() idea above, then it wouldn't matter in that scenario
above if ctx->cur_folio points to a folio that already had read ended
on it. But I think that also just makes the code harder to
read/understand.

The original patch seemed cleanest to me, maybe if we renamed uptodate
to mark_uptodate, it'd be more appetible? eg

@@ -80,18 +80,19 @@ static void iomap_set_range_uptodate(struct folio
*folio, size_t off,
{
struct iomap_folio_state *ifs = folio->private;
unsigned long flags;
- bool uptodate = true;
+ bool mark_uptodate = true;

if (folio_test_uptodate(folio))
return;

if (ifs) {
spin_lock_irqsave(&ifs->state_lock, flags);
- uptodate = ifs_set_range_uptodate(folio, ifs, off, len);
+ mark_uptodate = ifs_set_range_uptodate(folio, ifs, off, len) &&
+ !ifs->read_bytes_pending;
spin_unlock_irqrestore(&ifs->state_lock, flags);
}

- if (uptodate)
+ if (mark_uptodate)
folio_mark_uptodate(folio);
}

Thanks,
Joanne

>
> > I think the code would have to look something like [1] (this is
> > similar to the alternative approach I mentioned in my previous reply
> > but fixed up to cover some more edge cases).
> >
> > Thanks,
> > Joanne
> >
> > [1] https://github.com/joannekoong/linux/commit/b42f47726433a8130e8c27d1b43b16e27dfd6960
>
> I think we can do everything we need with a suitably modified
> iomap_finish_folio_read() rather than the new iomap_finish_read_range().