Re: [BUG] iomap: NULL dereference in iomap_finish_folio_read after deferred failed read
From: Joanne Koong
Date: Wed May 27 2026 - 00:13:25 EST
On Tue, May 26, 2026 at 8:25 PM Sam Sun <samsun1006219@xxxxxxxxx> wrote:
>
> On Mon, May 25, 2026 at 1:43 PM Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote:
> >
> > On Sat, May 23, 2026 at 06:58:46PM +0800, Sam Sun wrote:
> > > Dear developers and maintainers,
> > >
> > > We encountered a NULL pointer dereference in the deferred buffered
> > > read failure path. Unfortunately, no reproducer is available yet.
> >
> > [snip]
> >
> > > We analyzed the cause of this report. It seems that failed buffered
> > > read bios are deferred to a global work item "failed_read_work". The
> > > deferred worker still owns the bio/folio, but by the time it calls
> > > `iomap_finish_folio_read()`, the folio may have been truncated,
> > > invalidated, or otherwise detached from its mapping. In that case,
> > > `folio->mapping` is NULL, but the error reporting path unconditionally
> > > dereferences `folio->mapping->host`. This pattern still exists on the
> > > latest kernel commit.
> > >
> > > A possible fix could be avoiding the `folio->mapping->host`
> >
> > The folio is locked until the folio_end_read() call. So I don't see
> > what could legitimately clear folio->mapping here. Curiously waiting
> > for the reproducer to see waht is going wrong here.
> >
>
> I have more data now. I agree that the folio is supposed to stay locked
> until `folio_end_read()`, but that does not fully protect this path. The
> problematic window is a non-final read completion, after
> `ifs->read_bytes_pending` has been decremented and before the error reporting
> path dereferences `folio->mapping->host`. The relevant code is still
> present in v7.1-rc5.
>
> The race I observed is:
>
> 1. A failed buffered read completion reaches `iomap_finish_folio_read()`.
> 2. It decrements `ifs->read_bytes_pending`, but `finished` is false
> (`pending_after=2048` in the logs below), so this completion is not the one
> that unlocks the folio.
> 3. If the worker is delayed/preempted before the `fserror_report_io()` call,
> another completion can finish the remaining bytes and unlock the folio.
> 4. Userspace then truncates the XFS file. The truncate path removes the folio
> from the page cache and clears `folio->mapping` in
> `delete_from_page_cache_batch()`.
> 5. The delayed worker resumes and dereferences `folio->mapping->host`.
Does it not suffice to just call fserror_report_io() before
decrementing ifs->read_bytes_pending? eg
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index d7b648421a70..d55b936e6986 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -400,6 +400,11 @@ void iomap_finish_folio_read(struct folio *folio,
size_t off, size_t len,
bool uptodate = !error;
bool finished = true;
+ if (error)
+ fserror_report_io(folio->mapping->host, FSERR_BUFFERED_READ,
+ folio_pos(folio) + off, len, error,
+ GFP_ATOMIC);
+
if (ifs) {
unsigned long flags;
@@ -411,11 +416,6 @@ void iomap_finish_folio_read(struct folio *folio,
size_t off, size_t len,
spin_unlock_irqrestore(&ifs->state_lock, flags);
}
- if (error)
- fserror_report_io(folio->mapping->host, FSERR_BUFFERED_READ,
- folio_pos(folio) + off, len, error,
- GFP_ATOMIC);
-
if (finished)
folio_end_read(folio, uptodate);
With read_bytes_pending still > 0, it's guaranteed that the folio is
locked and truncate can't invalidate folio->mapping yet as it requires
the folio lock.
Thanks,
Joanne
>
> In other words, the truncate path clears the mapping after the folio
> has legitimately
> been unlocked by another, final read completion. The unsafe part is that the
> non-final failed completion uses `folio->mapping->host` without holding a lock
> or reference that keeps `folio->mapping` stable.
>
> One possible interleaving is:
>
> ```
> CPU0: failed_read_work CPU1: another read completion CPU2: userspace truncate
> -------------------------------------
> ---------------------------------- --------------------------------
>
> iomap_finish_folio_read(folio, err)
>
> lock(ifs->state_lock)
> ifs->read_bytes_pending -= len
> finished = false
> unlock(ifs->state_lock)
>
> if (error)
> /* about to execute:
> * fserror_report_io(folio->mapping->host, ...)
> */
> <preempted here>
> iomap_finish_folio_read(folio, 0)
>
> lock(ifs->state_lock)
> ifs->read_bytes_pending -= len
> finished = true
> unlock(ifs->state_lock)
>
> folio_end_read(folio, true)
>
>
> open(..., O_TRUNC)
>
> do_truncate()
>
> truncate_setsize()
>
> truncate_inode_pages()
>
>
> lock_folio(folio)
>
> delete_from_page_cache_batch()
>
> folio->mapping = NULL
>
> unlock_folio(folio)
>
> <resumed>
> fserror_report_io(folio->mapping->host, ...)
> ^^^^^^^^^^^^^^
> NULL now
> ```
>
> To validate the race on v7.1-rc5, I used a diagnostic build that is
> equivalent to
> placing a breakpoint immediately before the original dereference.
> The diagnostic patch:
>
> - logs non-final failed completions and records the folio/mapping being used by
> the error path;
> - logs immediately before the existing `folio->mapping = NULL` assignments in
> `page_cache_delete()` and `delete_from_page_cache_batch()`;
> - defers only the non-final `fserror_report_io(folio->mapping->host, ...)` by
> one second so the truncate race has time to run.
>
> With the syz corpus on v7.1-rc5 plus the diagnostic breakpoint:
>
> ```text
> [ 116.142333] d280-iomap-latest: deferred non-final fserror
> folio=ffd40000013ea940 mapping=ff1100004560b6b8 pending_after=2048
> comm=kworker/0:7
> [ 116.177780] d280-filemap-latest: delete_from_page_cache_batch
> clearing watched folio=ffd40000013ea940 mapping=ff1100004560b6b8
> host=ff1100004560b4a0 ino=9286 index=0 locked=1 ref=4 comm=syz.5.30
> [ 117.330579] d280-iomap-latest: deferred fserror
> folio=ffd40000013ea940 mapping=0000000000000000 host=0000000000000000
> index=0 pos=0 off=0 len=2048 locked=0 ref=1 comm=kworker/0:9
> [ 117.336816] Oops: general protection fault, probably for
> non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN NOPTI
> [ 117.337746] KASAN: null-ptr-deref in range
> [0x0000000000000000-0x0000000000000007]
> [ 117.339635] Workqueue: events d280_deferred_report_work
> ```
>
> The detach is done by the normal truncate path:
>
> ```text
> delete_from_page_cache_batch()
> truncate_inode_pages_range()
> truncate_pagecache()
> xfs_vn_setattr_size()
> xfs_vn_setattr()
> notify_change()
> do_truncate()
> path_openat()
> do_file_open()
> __x64_sys_open()
> ```
>