Re: [PATCH] btrfs: reject root with mismatched level between root_item and node header

From: ZhengYuan Huang

Date: Thu Mar 12 2026 - 22:50:05 EST

On Fri, Mar 13, 2026 at 5:29 AM Qu Wenruo <wqu@xxxxxxxx> wrote:
> Nope, we have btrfs_tree_parent_check structure, which has all the
> needed checks at read time.
>
> The point of using that other than doing it manually here is, if one
> mirror is bad, but the other mirror is good, then we can still grab the
> good copy, but checking it here means if we got the bad mirror first, we
> have no more chance.
>
> And during read of root-node, we have already passed the proper level
> into it.
>
> So the only possibility is, your fuzzing tool is modifying the memory
> after the read check.
>
> If so, it's impossible to fix.

Thanks for the review and for pointing this out.

I agree that btrfs_tree_parent_check is the intended read-time
verifier, but the crash path here relies on a cache-hit bypass where
that verification is not re-run.

My earlier description may have been misleading, or at least not clear
enough, so let me clarify the exact trigger path in more detail below.

Two different metadata blocks are involved:
- Block A: a root-tree leaf containing root_item for tree 265 (this
field is corrupted: root_item.level = 1)
item 11 key (265 ROOT_ITEM 0) itemoff 13489 itemsize 439
generation 4255 root_dirid 256 bytenr 18787663872 level 1 refs 1
lastsnap 4214 byte_limit 0 bytes_used 16384 flags 0x0(none)
uuid 4cc4bc58-9708-2848-a264-19b95269f104
ctransid 13 otransid 13 stransid 0 rtransid 0
ctime 1766050670.362764444 (2025-12-18 09:37:50)
otime 1766050670.362000000 (2025-12-18 09:37:50)
drop key (0 UNKNOWN.0 0) level 0

- Block B: the actual tree-265 root block at bytenr 18787663872
(header.level = 0, otherwise valid)
item 57 key (18787663872 METADATA_ITEM 0) itemoff 14198 itemsize 33
refs 1 gen 4255 flags TREE_BLOCK
tree block skinny level 0
tree block backref root 265

In relocate_tree_blocks phase 1 (get_tree_block_key), block B is read
with check.level = block->level = 0 (from extent-tree metadata for
that extent item).
This I/O path runs btrfs_validate_extent_buffer, and level check
passes (found 0, expected 0).
So block B becomes EXTENT_BUFFER_UPTODATE.

In phase 2 (build_backref_tree -> handle_indirect_tree_backref ->
btrfs_get_fs_root -> read_tree_root_path), level is taken from
root_item in block A via btrfs_root_level, so expected level becomes 1
(corrupt value).
Then read_tree_block is called for the same bytenr (block B), but now
it hits EXTENT_BUFFER_UPTODATE and returns from
read_extent_buffer_pages_nowait early.

On that cache-hit path, btrfs_validate_extent_buffer is not executed
again, so no level mismatch check occurs for expected=1 vs actual=0.

Because no read error is returned on cache hit, mirror retry logic is
never entered. So this is not a “bad mirror first, good mirror later”
case: there is no second mirror attempt because the read already
succeeded from cache.

read_tree_root_path then builds an inconsistent root object:
- root->root_item.level = 1 (from block A)
- root->node/commit_root level = 0 (from cached block B)

handle_indirect_tree_backref computes level = cur->level + 1 = 1,
searches commit_root (actual level 0), path->nodes[1] remains NULL,
and btrfs_node_blockptr(NULL, ...) crashes. So the issue is a
cross-block consistency gap at root construction time, not post-read
memory corruption by the fuzzer.

That is why the fix in read_tree_root_path (checking
btrfs_header_level(root->node) == btrfs_root_level(&root->root_item))
is needed even with btrfs_tree_parent_check in place.

To clarify, our fuzzing tool does not perform any in-memory
modification during testing. In fact, this bug is not caused by memory
corruption at all; it is triggered entirely by corrupted on-disk
metadata together with a cache-hit path that skips re-validation of
the root block. I have also uploaded the reproduction script to
https://drive.google.com/drive/folders/1BPXcgVI4DLzDcufNyynOakKD4EKnfVCg.

To reproduce the issue:
1. Build the PoC program: gcc repro.c -o poc
2. Build the ublk helper program from the ublk codebase, which is
used to provide the runtime corruption capability:
g++ -std=c++20 -fcoroutines -O2 -o standalone_replay \
standalone_replay_btrfs.cpp targets/ublksrv_tgt.cpp \
-I. -Iinclude -Itargets/include \
-L./lib/.libs -lublksrv -luring -lpthread
3. Attach the crafted image through ublk:
./standalone_replay add -t loop -f /path/to/image
4. Run the PoC: ./poc
This reliably reproduces the bug.

Thanks,
ZhengYuan Huang