Re: kernel BUG in __clear_extent_bit

From: Qu Wenruo
Date: Wed Sep 22 2021 - 22:31:42 EST




On 2021/9/23 10:24, Hao Sun wrote:
Qu Wenruo <quwenruo.btrfs@xxxxxxx> 于2021年9月15日周三 下午1:33写道:



On 2021/9/15 上午10:20, Hao Sun wrote:
Hello,

When using Healer to fuzz the latest Linux kernel, the following crash
was triggered.

HEAD commit: 6880fa6c5660 Linux 5.15-rc1
git tree: upstream
console output:
https://drive.google.com/file/d/1-9wwV6-OmBcJvHGCbMbP5_uCVvrUdTp3/view?usp=sharing
kernel config: https://drive.google.com/file/d/1rUzyMbe5vcs6khA3tL9EHTLJvsUdWcgB/view?usp=sharing
C reproducer: https://drive.google.com/file/d/1eXePTqMQ5ZA0TWtgpTX50Ez4q9ZKm_HE/view?usp=sharing
Syzlang reproducer:
https://drive.google.com/file/d/11s13louoKZ7Uz0mdywM2jmE9B1JEIt8U/view?usp=sharing

If you fix this issue, please add the following tag to the commit:
Reported-by: Hao Sun <sunhao.th@xxxxxxxxx>

loop1: detected capacity change from 0 to 32768
BTRFS info (device loop1): disk space caching is enabled
BTRFS info (device loop1): has skinny extents
BTRFS info (device loop1): enabling ssd optimizations
FAULT_INJECTION: forcing a failure.
name failslab, interval 1, probability 0, space 0, times 0
CPU: 1 PID: 25852 Comm: syz-executor Not tainted 5.15.0-rc1 #16
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
Call Trace:
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x8d/0xcf lib/dump_stack.c:106
fail_dump lib/fault-inject.c:52 [inline]
should_fail+0x13c/0x160 lib/fault-inject.c:146
should_failslab+0x5/0x10 mm/slab_common.c:1328
slab_pre_alloc_hook.constprop.99+0x4e/0xc0 mm/slab.h:494
slab_alloc_node mm/slub.c:3120 [inline]
slab_alloc mm/slub.c:3214 [inline]
kmem_cache_alloc+0x44/0x280 mm/slub.c:3219
alloc_extent_state+0x1e/0x1c0 fs/btrfs/extent_io.c:340

This is the one of the core systems btrfs uses, and we really don't want
that to fail.

Thus in fact it does some preallocation to prevent failure.

But for error injection case, we can still hit BUG_ON() which is used to
catch ENOMEM.


Hello,

Fuzzer triggered following crashes repeatedly when the `fault
injection` was enabled.

HEAD commit: 92477dd1faa6 Merge tag 's390-5.15-ebpf-jit-fixes'
git tree: upstream
kernel config: https://drive.google.com/file/d/1KgvcM8i_3hQiOL3fUh3JFpYNQM4itvV4/view?usp=sharing
[1] kernel BUG in btrfs_free_tree_block (fs/btrfs/extent-tree.c:3297):
https://paste.ubuntu.com/p/ZtzVKWbcGm/
[2] kernel BUG in clear_state_bit (fs/btrfs/extent_io.c:658!):
https://paste.ubuntu.com/p/hps2wXPG2b/
[3] kernel BUG in set_extent_bit (fs/btrfs/extent_io.c:1021):
https://paste.ubuntu.com/p/dcptjYYxgd/
[4] kernel BUG in set_state_bits (fs/btrfs/extent_io.c:939):
https://paste.ubuntu.com/p/NV9qtKB4KZ/

All the above crashes were triggered directly by the `BUG_ON()` macro
in the corresponding location.
Most `BUG_ON()` was hit due to `ENOMEM` when fault injected.
Would it be better for btrfs to handle the `ENOMEM` error, e.g.,
gracefully return, rather than panic the kernel?

The __clear_extent_bit() part is one of the critical section where we
really rely on that to be work correctly.

We even implemented a preallocation scheme to prevent such problems, but
error injection won't be completely resolved by preallocation.

We indeed need to do the error handling better, but that would be a
pretty big project, not something can be easily done right now.

Thanks,
Qu

Regards
Hao