Re: [RFC PATCH 1/1] iomap: fix race between iomap_set_range_uptodate and folio_end_read

From: Matthew Wilcox

Date: Thu Feb 12 2026 - 14:31:56 EST

On Wed, Feb 11, 2026 at 03:13:48PM -0800, Joanne Koong wrote:
> On Wed, Feb 11, 2026 at 1:03 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> >
> > On Wed, Feb 11, 2026 at 11:33:05AM -0800, Joanne Koong wrote:
> > > ifs->read_bytes_pending gets initialized to the folio size, but if the
> > > file being read in is smaller than the size of the folio, then we
> > > reach this scenario because the file has been read in but
> > > ifs->read_bytes_pending is still a positive value because it
> > > represents the bytes between the end of the file and the end of the
> > > folio. If the folio size is 16k and the file size is 4k:
> > > a) ifs->read_bytes_pending gets initialized to 16k
> > > b) ->read_folio_range() is called for the 4k read
> > > c) the 4k read succeeds, ifs->read_bytes_pending is now 12k and the
> > > 0 to 4k range is marked uptodate
> > > d) the post-eof blocks are zeroed and marked uptodate in the call to
> > > iomap_set_range_uptodate()
> >
> > This is the bug then. If they're marked uptodate, read_bytes_pending
> > should be decremented at the same time. Now, I appreciate that
> > iomap_set_range_uptodate() is called both from iomap_read_folio_iter()
> > and __iomap_write_begin(), and it can't decrement read_bytes_pending
> > in the latter case. Perhaps a flag or a second length parameter is
> > the solution?
>
> I don't think it's enough to decrement read_bytes_pending by the
> zeroed/read-inline length because there's these two edge cases:
> a) some blocks in the folio were already uptodate from the very
> beginning and skipped for IO but not decremented yet from
> ifs->read_bytes_pending, which means in iomap_read_end(),
> ifs->read_bytes_pending would be > 0 and the uptodate flag could get
> XORed again. This means we need to also decrement read_bytes_pending
> by bytes_submitted as well for this case

Hm, that's a good one. It can't happen for readahead, but it can happen
if we start out by writing to some blocks of a folio, then call
read_folio to get the remaining blocks uptodate. We could avoid it
happening by initialising read_bytes_pending to folio_size() -
bitmap_weight(ifs->uptodate) * block_size.

> b) the async ->read_folio_range() callback finishes after the
> zeroing's read_bytes_pending decrement and calls folio_end_read(), so
> we need to assign ctx->cur_folio to NULL

If we return 'finished' from iomap_finish_folio_read(), we can handle
this?

> I think the code would have to look something like [1] (this is
> similar to the alternative approach I mentioned in my previous reply
> but fixed up to cover some more edge cases).
>
> Thanks,
> Joanne
>
> [1] https://github.com/joannekoong/linux/commit/b42f47726433a8130e8c27d1b43b16e27dfd6960

I think we can do everything we need with a suitably modified
iomap_finish_folio_read() rather than the new iomap_finish_read_range().