Re: [PATCH] btrfs: reject root with mismatched level between root_item and node header

From: Qu Wenruo

Date: Thu Mar 12 2026 - 23:10:21 EST

在 2026/3/13 13:19, ZhengYuan Huang 写道:

On Fri, Mar 13, 2026 at 5:29 AM Qu Wenruo <wqu@xxxxxxxx> wrote:

Nope, we have btrfs_tree_parent_check structure, which has all the
needed checks at read time.

The point of using that other than doing it manually here is, if one
mirror is bad, but the other mirror is good, then we can still grab the
good copy, but checking it here means if we got the bad mirror first, we
have no more chance.

And during read of root-node, we have already passed the proper level
into it.

So the only possibility is, your fuzzing tool is modifying the memory
after the read check.

If so, it's impossible to fix.

Thanks for the review and for pointing this out.

I agree that btrfs_tree_parent_check is the intended read-time
verifier, but the crash path here relies on a cache-hit bypass where
that verification is not re-run.

My earlier description may have been misleading, or at least not clear
enough, so let me clarify the exact trigger path in more detail below.

My bad, I forgot to mention the correct way to fix: you should put the check into the cached hit path, other than adhocing random checks around.

The correct way is to add an optional @check parameter for btrfs_buffer_uptodate() so that cached extent buffer will still be checked.

Two different metadata blocks are involved:
- Block A: a root-tree leaf containing root_item for tree 265 (this
field is corrupted: root_item.level = 1)
item 11 key (265 ROOT_ITEM 0) itemoff 13489 itemsize 439
generation 4255 root_dirid 256 bytenr 18787663872 level 1 refs 1
lastsnap 4214 byte_limit 0 bytes_used 16384 flags 0x0(none)
uuid 4cc4bc58-9708-2848-a264-19b95269f104
ctransid 13 otransid 13 stransid 0 rtransid 0
ctime 1766050670.362764444 (2025-12-18 09:37:50)
otime 1766050670.362000000 (2025-12-18 09:37:50)
drop key (0 UNKNOWN.0 0) level 0

- Block B: the actual tree-265 root block at bytenr 18787663872
(header.level = 0, otherwise valid)
item 57 key (18787663872 METADATA_ITEM 0) itemoff 14198 itemsize 33
refs 1 gen 4255 flags TREE_BLOCK
tree block skinny level 0
tree block backref root 265

In relocate_tree_blocks phase 1 (get_tree_block_key), block B is read
with check.level = block->level = 0 (from extent-tree metadata for
that extent item).
This I/O path runs btrfs_validate_extent_buffer, and level check
passes (found 0, expected 0).
So block B becomes EXTENT_BUFFER_UPTODATE.

In phase 2 (build_backref_tree -> handle_indirect_tree_backref ->
btrfs_get_fs_root -> read_tree_root_path), level is taken from
root_item in block A via btrfs_root_level, so expected level becomes 1
(corrupt value).
Then read_tree_block is called for the same bytenr (block B), but now
it hits EXTENT_BUFFER_UPTODATE and returns from
read_extent_buffer_pages_nowait early.

On that cache-hit path, btrfs_validate_extent_buffer is not executed
again, so no level mismatch check occurs for expected=1 vs actual=0.

Because no read error is returned on cache hit, mirror retry logic is
never entered. So this is not a “bad mirror first, good mirror later”
case: there is no second mirror attempt because the read already
succeeded from cache.

read_tree_root_path then builds an inconsistent root object:
- root->root_item.level = 1 (from block A)
- root->node/commit_root level = 0 (from cached block B)

handle_indirect_tree_backref computes level = cur->level + 1 = 1,
searches commit_root (actual level 0), path->nodes[1] remains NULL,
and btrfs_node_blockptr(NULL, ...) crashes. So the issue is a
cross-block consistency gap at root construction time, not post-read
memory corruption by the fuzzer.

That is why the fix in read_tree_root_path (checking
btrfs_header_level(root->node) == btrfs_root_level(&root->root_item))
is needed even with btrfs_tree_parent_check in place.

To clarify, our fuzzing tool does not perform any in-memory
modification during testing. In fact, this bug is not caused by memory
corruption at all; it is triggered entirely by corrupted on-disk
metadata together with a cache-hit path that skips re-validation of
the root block. I have also uploaded the reproduction script to
https://drive.google.com/drive/folders/1BPXcgVI4DLzDcufNyynOakKD4EKnfVCg.

To reproduce the issue:
1. Build the PoC program: gcc repro.c -o poc
2. Build the ublk helper program from the ublk codebase, which is
used to provide the runtime corruption capability:
g++ -std=c++20 -fcoroutines -O2 -o standalone_replay \
standalone_replay_btrfs.cpp targets/ublksrv_tgt.cpp \
-I. -Iinclude -Itargets/include \
-L./lib/.libs -lublksrv -luring -lpthread
3. Attach the crafted image through ublk:
./standalone_replay add -t loop -f /path/to/image
4. Run the PoC: ./poc
This reliably reproduces the bug.

Thanks,
ZhengYuan Huang