Re: [PATCH] btrfs: wait for in-flight readahead BIOs on open_ctree() error
From: Qu Wenruo
Date: Sun Mar 29 2026 - 03:05:41 EST
在 2026/3/29 17:01, Teng Liu 写道:
When open_ctree() fails during btrfs_read_chunk_tree(), readahead BIOs
submitted by readahead_tree_node_children() may still be in flight. The
error path frees fs_info without waiting for these BIOs to complete.
When a readahead BIO later completes, btrfs_simple_end_io() calls
btrfs_bio_counter_dec() which accesses the already-freed
fs_info->dev_replace.bio_counter, causing a use-after-free.
This can be triggered by connecting a USB drive with a corrupted btrfs
filesystem (e.g. chunk tree destroyed by a partial format), where the
slow USB device keeps readahead BIOs in flight long enough for the
error path to free fs_info before they complete. It can be reproduced
on qemu with a properly corrupted btrfs img.
BTRFS error (device sda): failed to read chunk tree: -2
BTRFS error (device sda): open_ctree failed: -2
BUG: unable to handle page fault for address: ffff89322ceb3000
RIP: 0010:percpu_counter_add_batch+0xe/0xb0
btrfs_bio_counter_sub+0x22/0x60
btrfs_simple_end_io+0x32/0x90
blk_update_request+0x12b/0x480
scsi_end_request+0x26/0x1b0
scsi_io_completion+0x50/0x790
Fix this by waiting for the bio_counter to reach zero in the error path
before stopping workers, so all in-flight BIOs have completed their
callbacks before fs_info is freed. The bio_counter is already
initialized in init_mount_fs_info() so this wait is safe for all error
paths reaching the fail_sb_buffer label.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=221270
Reported-by: AHN SEOK-YOUNG
Signed-off-by: Teng Liu <27rabbitlt@xxxxxxxxx>
---
fs/btrfs/disk-io.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 01f2dbb69..61e6b8dca 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3723,6 +3723,18 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
invalidate_inode_pages2(fs_info->btree_inode->i_mapping);
fail_sb_buffer:
+ /*
+ * Wait for in-flight readahead BIOs before stopping workers.
+ * Readahead BIOs from btrfs_read_chunk_tree() (via
+ * readahead_tree_node_children) may still be in flight on slow
+ * devices (e.g. USB). Their completion callbacks
+ * (btrfs_simple_end_io) access fs_info->dev_replace.bio_counter
+ * which would be destroyed later, causing a use-after-free.
+ * The bio_counter was already initialized in init_mount_fs_info()
+ * so this wait is safe for all error paths reaching this label.
+ */
+ wait_event(fs_info->dev_replace.replace_wait,
+ percpu_counter_sum(&fs_info->dev_replace.bio_counter) == 0);
This doesn't make any sense to me.
The wait and counter are all for dev-reaplce, not matching your description of the generic metadata readahead.
If you want to wait for all existing metadata reads, I didn't find a good helper, thus you will need to go through all extent buffers and wait for EXTENT_BUFFER_READING flags.
btrfs_stop_all_workers(fs_info);
btrfs_free_block_groups(fs_info);
fail_alloc: