Re: [BUG] iomap: NULL dereference in iomap_finish_folio_read after deferred failed read

From: Sam Sun

Date: Fri May 29 2026 - 06:26:31 EST

On Wed, May 27, 2026 at 12:13 PM Joanne Koong <joannelkoong@xxxxxxxxx> wrote:
>
> On Tue, May 26, 2026 at 8:25 PM Sam Sun <samsun1006219@xxxxxxxxx> wrote:
> >
> > On Mon, May 25, 2026 at 1:43 PM Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote:
> > >
> > > On Sat, May 23, 2026 at 06:58:46PM +0800, Sam Sun wrote:
> > > > Dear developers and maintainers,
> > > >
> > > > We encountered a NULL pointer dereference in the deferred buffered
> > > > read failure path. Unfortunately, no reproducer is available yet.
> > >
> > > [snip]
> > >
> > > > We analyzed the cause of this report. It seems that failed buffered
> > > > read bios are deferred to a global work item "failed_read_work". The
> > > > deferred worker still owns the bio/folio, but by the time it calls
> > > > `iomap_finish_folio_read()`, the folio may have been truncated,
> > > > invalidated, or otherwise detached from its mapping. In that case,
> > > > `folio->mapping` is NULL, but the error reporting path unconditionally
> > > > dereferences `folio->mapping->host`. This pattern still exists on the
> > > > latest kernel commit.
> > > >
> > > > A possible fix could be avoiding the `folio->mapping->host`
> > >
> > > The folio is locked until the folio_end_read() call. So I don't see
> > > what could legitimately clear folio->mapping here. Curiously waiting
> > > for the reproducer to see waht is going wrong here.
> > >
> >
> > I have more data now. I agree that the folio is supposed to stay locked
> > until `folio_end_read()`, but that does not fully protect this path. The
> > problematic window is a non-final read completion, after
> > `ifs->read_bytes_pending` has been decremented and before the error reporting
> > path dereferences `folio->mapping->host`. The relevant code is still
> > present in v7.1-rc5.
> >
> > The race I observed is:
> >
> > 1. A failed buffered read completion reaches `iomap_finish_folio_read()`.
> > 2. It decrements `ifs->read_bytes_pending`, but `finished` is false
> > (`pending_after=2048` in the logs below), so this completion is not the one
> > that unlocks the folio.
> > 3. If the worker is delayed/preempted before the `fserror_report_io()` call,
> > another completion can finish the remaining bytes and unlock the folio.
> > 4. Userspace then truncates the XFS file. The truncate path removes the folio
> > from the page cache and clears `folio->mapping` in
> > `delete_from_page_cache_batch()`.
> > 5. The delayed worker resumes and dereferences `folio->mapping->host`.
>
> Does it not suffice to just call fserror_report_io() before
> decrementing ifs->read_bytes_pending? eg
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index d7b648421a70..d55b936e6986 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -400,6 +400,11 @@ void iomap_finish_folio_read(struct folio *folio,
> size_t off, size_t len,
> bool uptodate = !error;
> bool finished = true;
>
> + if (error)
> + fserror_report_io(folio->mapping->host, FSERR_BUFFERED_READ,
> + folio_pos(folio) + off, len, error,
> + GFP_ATOMIC);
> +
> if (ifs) {
> unsigned long flags;
>
> @@ -411,11 +416,6 @@ void iomap_finish_folio_read(struct folio *folio,
> size_t off, size_t len,
> spin_unlock_irqrestore(&ifs->state_lock, flags);
> }
>
> - if (error)
> - fserror_report_io(folio->mapping->host, FSERR_BUFFERED_READ,
> - folio_pos(folio) + off, len, error,
> - GFP_ATOMIC);
> -
> if (finished)
> folio_end_read(folio, uptodate);
>
>
> With read_bytes_pending still > 0, it's guaranteed that the folio is
> locked and truncate can't invalidate folio->mapping yet as it requires
> the folio lock.
>
> Thanks,
> Joanne
>

Yes, that makes sense. For the race I observed, moving
fserror_report_io() before the read_bytes_pending decrement should
keep the failing completion's range counted while folio->mapping->host
is used. That means another completion should not be able to make
read_bytes_pending reach zero and unlock the folio before the error
report runs, so truncate/invalidation should not be able to clear
folio->mapping in that window.

So this looks like it addresses the interleaving I reproduced. Thanks
for pointing this out.

Best,
Yue

> >
> > In other words, the truncate path clears the mapping after the folio
> > has legitimately
> > been unlocked by another, final read completion. The unsafe part is that the
> > non-final failed completion uses `folio->mapping->host` without holding a lock
> > or reference that keeps `folio->mapping` stable.
> >
> > One possible interleaving is:
> >
> > ```
> > CPU0: failed_read_work CPU1: another read completion CPU2: userspace truncate
> > -------------------------------------
> > ---------------------------------- --------------------------------
> >
> > iomap_finish_folio_read(folio, err)
> >
> > lock(ifs->state_lock)
> > ifs->read_bytes_pending -= len
> > finished = false
> > unlock(ifs->state_lock)
> >
> > if (error)
> > /* about to execute:
> > * fserror_report_io(folio->mapping->host, ...)
> > */
> > <preempted here>
> > iomap_finish_folio_read(folio, 0)
> >
> > lock(ifs->state_lock)
> > ifs->read_bytes_pending -= len
> > finished = true
> > unlock(ifs->state_lock)
> >
> > folio_end_read(folio, true)
> >
> >
> > open(..., O_TRUNC)
> >
> > do_truncate()
> >
> > truncate_setsize()
> >
> > truncate_inode_pages()
> >
> >
> > lock_folio(folio)
> >
> > delete_from_page_cache_batch()
> >
> > folio->mapping = NULL
> >
> > unlock_folio(folio)
> >
> > <resumed>
> > fserror_report_io(folio->mapping->host, ...)
> > ^^^^^^^^^^^^^^
> > NULL now
> > ```
> >
> > To validate the race on v7.1-rc5, I used a diagnostic build that is
> > equivalent to
> > placing a breakpoint immediately before the original dereference.
> > The diagnostic patch:
> >
> > - logs non-final failed completions and records the folio/mapping being used by
> > the error path;
> > - logs immediately before the existing `folio->mapping = NULL` assignments in
> > `page_cache_delete()` and `delete_from_page_cache_batch()`;
> > - defers only the non-final `fserror_report_io(folio->mapping->host, ...)` by
> > one second so the truncate race has time to run.
> >
> > With the syz corpus on v7.1-rc5 plus the diagnostic breakpoint:
> >
> > ```text
> > [ 116.142333] d280-iomap-latest: deferred non-final fserror
> > folio=ffd40000013ea940 mapping=ff1100004560b6b8 pending_after=2048
> > comm=kworker/0:7
> > [ 116.177780] d280-filemap-latest: delete_from_page_cache_batch
> > clearing watched folio=ffd40000013ea940 mapping=ff1100004560b6b8
> > host=ff1100004560b4a0 ino=9286 index=0 locked=1 ref=4 comm=syz.5.30
> > [ 117.330579] d280-iomap-latest: deferred fserror
> > folio=ffd40000013ea940 mapping=0000000000000000 host=0000000000000000
> > index=0 pos=0 off=0 len=2048 locked=0 ref=1 comm=kworker/0:9
> > [ 117.336816] Oops: general protection fault, probably for
> > non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN NOPTI
> > [ 117.337746] KASAN: null-ptr-deref in range
> > [0x0000000000000000-0x0000000000000007]
> > [ 117.339635] Workqueue: events d280_deferred_report_work
> > ```
> >
> > The detach is done by the normal truncate path:
> >
> > ```text
> > delete_from_page_cache_batch()
> > truncate_inode_pages_range()
> > truncate_pagecache()
> > xfs_vn_setattr_size()
> > xfs_vn_setattr()
> > notify_change()
> > do_truncate()
> > path_openat()
> > do_file_open()
> > __x64_sys_open()
> > ```
> >