[RFC PATCH] fs/buffer: serialize set_buffer_uptodate against concurrent clears

From: Chao Shi

Date: Sat Apr 25 2026 - 22:02:00 EST

A WARN_ON_ONCE(!buffer_uptodate(bh)) in mark_buffer_dirty() is reachable
from the buffered write path on a block device when the underlying
device returns I/O errors at high density. Reproduced by fuzzing an
NVMe controller (FEMU) that returns crafted error completions for a
sustained workload from /dev/nvme0n1.

The race is:

CPU A: block_commit_write (folio lock held) CPU B: end_buffer_async_read
set_buffer_uptodate(bh);
clear_buffer_uptodate(bh);
mark_buffer_dirty(bh); /* WARN fires */

The contract documented at set_buffer_uptodate() in
include/linux/buffer_head.h:140 already states:

"Any other serialization (with IO errors or whatever that might
clear the bit) has to come from other state (eg BH_Lock)."

block_commit_write() and the buffer_new() branch in
__block_write_begin_int() violate this contract: they hold the folio
lock but not BH_Lock when calling set_buffer_uptodate() immediately
followed by mark_buffer_dirty(). Take BH_Lock around the pair so the
documented serialization holds.

The race is the same family as 558d6450c775 ("ext4: fix
WARN_ON_ONCE(!buffer_uptodate) after an error writing the
superblock"), which addressed the ext4 superblock-specific case via
state recovery. No equivalent recovery hook exists in the generic
block_commit_write() path, so apply BH_Lock instead.

WARN stack:

RIP: mark_buffer_dirty+0x4c2/0x560 fs/buffer.c:1183
Call Trace:
block_commit_write fs/buffer.c
block_write_end fs/buffer.c
iomap_write_end fs/iomap/buffered-io.c
iomap_file_buffered_write
blkdev_buffered_write block/fops.c
blkdev_write_iter
vfs_write
__x64_sys_write

Found by FuzzNvme(Syzkaller with FEMU fuzzing framework).

Acked-by: Sungwoo Kim <iam@xxxxxxxxxxxx>
Acked-by: Dave Tian <daveti@xxxxxxxxxx>
Acked-by: Weidong Zhu <weizhu@xxxxxxx>
Signed-off-by: Chao Shi <coshi036@xxxxxxxxx>
---

Notes for reviewers (RFC):

1. lock_buffer() in the buffered-write end path: in the steady state
the bh should be unlocked when block_commit_write() runs
(block_write_begin already waited for any RMW read), so contention
should be rare. fio numbers TBD; happy to defer until measured if
that is the bar.

2. Several other call sites have the same set_buffer_uptodate(bh)
immediately followed by mark_buffer_dirty(bh) pattern without
BH_Lock:

fs/nilfs2/mdt.c:60
fs/ocfs2/alloc.c:6840
fs/ocfs2/aops.c:655
fs/exfat/fatent.c:408
fs/exfat/misc.c:168
fs/ufs/ialloc.c:146
fs/ufs/inode.c:1076, 1088
fs/ufs/balloc.c:324
fs/jfs/super.c:770
fs/ext2/super.c:1594
fs/ntfs3/fsntfs.c:1096, 1491
fs/ntfs3/bitmap.c:755, 796, 1387

Most look like one-shot init / metadata paths where concurrent IO
completion on the same bh is unlikely, but I have not audited each.
Per-fs follow-ups by respective maintainers, or one tree-wide
series?

3. Reproducer is currently fuzzer-only (FEMU + syz-executor). A
minimal C reproducer using dm-flakey for read-error injection is
in progress.

fs/buffer.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)

diff --git a/fs/buffer.c b/fs/buffer.c
index 4d7f84e77d2..bc4fad93392 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2041,9 +2041,16 @@ int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
if (buffer_new(bh)) {
clean_bdev_bh_alias(bh);
if (folio_test_uptodate(folio)) {
+ /*
+ * See block_commit_write() for why we
+ * must hold BH_Lock around set_uptodate
+ * + mark_dirty.
+ */
+ lock_buffer(bh);
clear_buffer_new(bh);
set_buffer_uptodate(bh);
mark_buffer_dirty(bh);
+ unlock_buffer(bh);
continue;
}
if (block_end > to || block_start < from)
@@ -2104,8 +2111,20 @@ void block_commit_write(struct folio *folio, size_t from, size_t to)
if (!buffer_uptodate(bh))
partial = true;
} else {
+ /*
+ * Per the contract documented at set_buffer_uptodate()
+ * (include/linux/buffer_head.h), callers must hold
+ * BH_Lock to serialize against concurrent clears of
+ * BH_Uptodate. Holding only the folio lock is not
+ * sufficient: a concurrent end_buffer_async_read() on
+ * a previously failed read can clear BH_Uptodate
+ * between set_buffer_uptodate() and mark_buffer_dirty(),
+ * tripping the WARN_ON_ONCE in mark_buffer_dirty().
+ */
+ lock_buffer(bh);
set_buffer_uptodate(bh);
mark_buffer_dirty(bh);
+ unlock_buffer(bh);
}
if (buffer_new(bh))
clear_buffer_new(bh);
--
2.43.0