Re: [PATCH v2 7/8] ext4: Use rbtrees to manage PAs instead of inode i_prealloc_list

From: Ojaswin Mujoo
Date: Tue Nov 29 2022 - 09:38:16 EST


On Mon, Nov 28, 2022 at 10:06:34PM -0500, Theodore Ts'o wrote:
> This commit (determined via bisecion) seems to be causing a reliable
> failure using the ext4/ext3 configuration when running generic/269:
>
> % kvm-xfstests -c ext4/ext3 generic/269
> ...
> BEGIN TEST ext3 (1 test): Ext4 4k block emulating ext3 Mon Nov 28 21:39:35 EST 2022
> DEVICE: /dev/vdd
> EXT_MKFS_OPTIONS: -O ^extents,^flex_bg,^uninit_bg,^64bit,^metadata_csum,^huge_file,^die
> EXT_MOUNT_OPTIONS: -o block_validity,nodelalloc
> FSTYP -- ext4
> PLATFORM -- Linux/x86_64 kvm-xfstests 6.1.0-rc4-xfstests-00018-g1c85d4890f15 #8492
> MKFS_OPTIONS -- -F -q -O ^extents,^flex_bg,^uninit_bg,^64bit,^metadata_csum,^huge_filc
> MOUNT_OPTIONS -- -o acl,user_xattr -o block_validity,nodelalloc /dev/vdc /vdc
>
> generic/269 23s ... [21:39:35][ 3.085973] run fstests generic/269 at 2022-11-28 215
> [ 14.931680] ------------[ cut here ]------------
> [ 14.931902] kernel BUG at fs/ext4/mballoc.c:4025!
> [ 14.932137] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> [ 14.932366] CPU: 1 PID: 2677 Comm: fsstress Not tainted 6.1.0-rc4-xfstests-00018-g19
> [ 14.932756] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-4
> [ 14.933169] RIP: 0010:ext4_mb_pa_adjust_overlap.constprop.0+0x18e/0x1c0
> [ 14.933457] Code: 66 54 8b 48 54 89 4c 24 04 e8 ae 92 9f 00 41 8b 46 40 85 c0 75 bc4
> [ 14.934270] RSP: 0018:ffffc90003aeb868 EFLAGS: 00010283
> [ 14.934499] RAX: 0000000000000000 RBX: 00000000000000fc RCX: 0000000000000000
> [ 14.934830] RDX: 0000000000000001 RSI: ffffc90003aeb8d4 RDI: 0000000000000001
> [ 14.935146] RBP: 0000000000000200 R08: 0000000000008000 R09: 0000000000000001
> [ 14.935447] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000103
> [ 14.935744] R13: 0000000000000101 R14: ffff8880073370e0 R15: ffff888007337118
> [ 14.936043] FS: 00007f94eda0b740(0000) GS:ffff88807dd00000(0000) knlGS:000000000000
> [ 14.936390] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 14.936634] CR2: 000055ba905a0448 CR3: 000000001092c005 CR4: 0000000000770ee0
> [ 14.936932] PKRU: 55555554
> [ 14.937048] Call Trace:
> [ 14.937190] <TASK>
> [ 14.937285] ext4_mb_normalize_request.constprop.0+0x1e9/0x440
> [ 14.937534] ext4_mb_new_blocks+0x3a2/0x560
> [ 14.937715] ext4_alloc_branch+0x21e/0x350
> [ 14.937892] ext4_ind_map_blocks+0x322/0x750
> [ 14.938076] ext4_map_blocks+0x380/0x6e0
> [ 14.938260] _ext4_get_block+0xb2/0x120
> [ 14.938426] ext4_block_write_begin+0x13c/0x3d0
> [ 14.938624] ? _ext4_get_block+0x120/0x120
> [ 14.938801] ext4_write_begin+0x1c1/0x570
> [ 14.938973] generic_perform_write+0xcf/0x220
> [ 14.939175] ext4_buffered_write_iter+0x84/0x140
> [ 14.939377] do_iter_readv_writev+0xf0/0x150
> [ 14.939562] do_iter_write+0x80/0x150
> [ 14.939722] vfs_writev+0xed/0x1f0
> [ 14.939871] do_writev+0x73/0x100
> [ 14.940016] do_syscall_64+0x37/0x90
> [ 14.940186] entry_SYSCALL_64_after_hwframe+0x63/0xcd
> [ 14.940403] RIP: 0033:0x7f94edb02da3
> [ 14.940559] Code: 8b 15 f1 90 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f8
> [ 14.941341] RSP: 002b:00007ffc5e82d0d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000014
> [ 14.941659] RAX: ffffffffffffffda RBX: 0000000000000036 RCX: 00007f94edb02da3
> [ 14.941961] RDX: 0000000000000356 RSI: 000055ba901c1240 RDI: 0000000000000003
> [ 14.942290] RBP: 0000000000000003 R08: 000055ba901cf240 R09: 00007f94edbccbe0
> [ 14.942596] R10: 0000000000000080 R11: 0000000000000246 R12: 000000000000062a
> [ 14.942902] R13: 0000000000000356 R14: 000055ba901c1240 R15: 000000000000b529
> [ 14.943219] </TASK>
> [ 14.943326] ---[ end trace 0000000000000000 ]---
>
> Looking at the stack trace it looks like we're hitting this BUG_ON:
>
> spin_lock(&tmp_pa->pa_lock);
> if (tmp_pa->pa_deleted == 0)
> BUG_ON(!(start >= tmp_pa_end || end <= tmp_pa_start));
> spin_unlock(&tmp_pa->pa_lock);
>
> ... in the inline function ext4_mb_pa_assert_overlap(), called from
> ext4_mb_pa_adjust_overlap().
>
> The generic/269 test runs fstress with an ENOSPC hitter as an
> antogonist process. The ext3 configuration disables delayed
> allocation, which means that fstress is going to be allocating blocks
> at write time (instead of dirty page writeback time).
>
> Could you take a look? Thanks!
Hi Ted,

Thanks for pointing this out, I'll have a look into this.

PS: I'm on vacation so might be a bit slow to update on this.

Regards,
Ojaswin
>
> - Ted