Re: [PATCH v2 0/5] btrfs: zoned: remove extent_buffer redirtying

From: Naohiro Aota
Date: Mon Feb 19 2024 - 01:04:15 EST


On Thu, Nov 23, 2023 at 07:47:14AM -0800, Johannes Thumshirn wrote:
> Since the beginning of zoned mode, I've promised Josef to get rid of the
> extent_buffer redirtying, but never actually got around to doing so.
>
> Then 2 weeks ago our CI has hit an ASSERT() in this area and I started to look
> into it again. After some discussion with Christoph we came to the conclusion
> to finally take the time and get rid of the extent_buffer redirtying once and
> for all.
>
> Patch one renames EXTENT_BUFFER_NO_CHECK into EXTENT_BUFFER_ZONED_ZEROOUT,
> because this fits the new model somewhat better.
>
> Number two sets the cancel bit instead of clearing the dirty bit from a zoned
> extent_buffer.
>
> Number three removes the last remaining bits of btrfs_redirty_list_add().
>
> The last two patches in this series are just trivial cleanups I came across
> while looking at the code.
>
> ---

Hello,

[SUMMARY]

With this series applied, running generic/013 200 times hits a
"__btrfs_free_extent:3251: errno=-117 Filesystem corrupted" error. I
investigated the issue and found a workaround patch. But, I'm still not
sure what is going on in the background. So, that patch might not address
the root cause.

[How it fails]

It fails with the following backtrace and error prints. It failed because
it cannot find a corresponding METADATA_ITEM for a delayed reference item.

It is interesting that the ref is trying to find "byte nr 66948939776
parent 66948939776 root 5 owner 0", but only finds "(66948939776 169 1)",
with the same bytenr and the same root, but different level. And, it fails
like this all the time. It sees the same bytenr but different level (1 !=
0).

Also, note that at this point, the extent buffer had no contents (filled
with zero).

Feb 17 17:25:43 kernel: BTRFS info (device nvme1n2): host-managed zoned block device /dev/nvme1n2, 1844 zones of 2147483648 bytes
Feb 17 17:25:43 kernel: BTRFS info (device nvme1n2): zoned: async discard ignored and disabled for zoned mode
Feb 17 17:25:43 kernel: BTRFS info (device nvme1n2): zoned mode enabled with zone size 2147483648
Feb 17 17:25:45 kernel: ------------[ cut here ]------------
Feb 17 17:25:45 kernel: WARNING: CPU: 9 PID: 29834 at fs/btrfs/extent-tree.c:3248 __btrfs_free_extent.isra.0+0x604/0x1330 [btrfs]
Feb 17 17:25:45 kernel: Modules linked in: dm_flakey algif_hash af_alg xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype iptable_filter ip_tables br_netfilter bridge stp llc overlay sunrpc kvm_amd kvm irqbypass rapl acpi_cpufreq ipmi_ssif i2c_piix4 k10temp btrfs ipmi_si blake2b_generic ipmi_devintf xor ipmi_msghandler raid6_pq bfq loop dm_mod zram bnxt_en ccp pkcs8_key_parser asn1_decoder public_key oid_registry fuse msr ipv6
Feb 17 17:25:45 kernel: CPU: 9 PID: 29834 Comm: fsstress Not tainted 6.8.0-rc4-BTRFS-ZNS+ #403
Feb 17 17:25:45 kernel: Hardware name: Supermicro Super Server/H12SSL-NT, BIOS 2.0 02/22/2021
Feb 17 17:25:45 kernel: RIP: 0010:__btrfs_free_extent.isra.0+0x604/0x1330 [btrfs]
Feb 17 17:25:45 kernel: Code: 8b 3f e8 bf 69 00 00 48 8b 7d 60 45 8b 4f 40 49 89 d8 8b 54 24 40 4c 89 e9 48 c7 c6 30 64 65 a0 e8 61 fb 0d 00 e9 8f fd ff ff <0f> 0b f0 48 0f ba 28 02 41 b8 00 00 00 00 0f 83 86 04 00 00 b9 8b
Feb 17 17:25:45 kernel: RSP: 0018:ffffc900090cfb80 EFLAGS: 00010246
Feb 17 17:25:45 kernel: RAX: ffff888365c719d8 RBX: 0000000f9677c000 RCX: 0000000000000001
Feb 17 17:25:45 kernel: RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000
Feb 17 17:25:45 kernel: RBP: ffff8889a044b220 R08: 0000000000000000 R09: 0000000000000004
Feb 17 17:25:45 kernel: R10: 0000000000000000 R11: 00000000ffffffff R12: 0000000000000001
Feb 17 17:25:45 kernel: R13: ffff888ad87a4c98 R14: 0000000000000005 R15: ffff888a0c7d2a80
Feb 17 17:25:45 kernel: FS: 00007f823f5f7740(0000) GS:ffff889fcea40000(0000) knlGS:0000000000000000
Feb 17 17:25:45 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 17 17:25:45 kernel: CR2: 0000560ce0610b38 CR3: 0000000a907ec000 CR4: 0000000000350ef0
Feb 17 17:25:45 kernel: Call Trace:
Feb 17 17:25:45 kernel: <TASK>
Feb 17 17:25:46 kernel: ? __btrfs_free_extent.isra.0+0x604/0x1330 [btrfs]
Feb 17 17:25:46 kernel: ? __warn+0x81/0x170
Feb 17 17:25:46 kernel: ? __btrfs_free_extent.isra.0+0x604/0x1330 [btrfs]
Feb 17 17:25:46 kernel: ? report_bug+0x18d/0x1c0
Feb 17 17:25:46 kernel: ? handle_bug+0x3c/0x70
Feb 17 17:25:46 kernel: ? exc_invalid_op+0x13/0x60
Feb 17 17:25:46 kernel: ? asm_exc_invalid_op+0x16/0x20
Feb 17 17:25:46 kernel: ? __btrfs_free_extent.isra.0+0x604/0x1330 [btrfs]
Feb 17 17:25:46 kernel: ? srso_return_thunk+0x5/0x5f
Feb 17 17:25:46 kernel: ? rcu_is_watching+0xd/0x40
Feb 17 17:25:46 kernel: ? srso_return_thunk+0x5/0x5f
Feb 17 17:25:46 kernel: ? lock_release+0x1e5/0x280
Feb 17 17:25:46 kernel: __btrfs_run_delayed_refs+0x64c/0x1380 [btrfs]
Feb 17 17:25:46 kernel: ? btrfs_commit_transaction+0x3e/0x12d0 [btrfs]
Feb 17 17:25:46 kernel: btrfs_run_delayed_refs+0x92/0x130 [btrfs]
Feb 17 17:25:46 kernel: btrfs_commit_transaction+0xa2/0x12d0 [btrfs]
Feb 17 17:25:46 kernel: ? srso_return_thunk+0x5/0x5f
Feb 17 17:25:46 kernel: ? srso_return_thunk+0x5/0x5f
Feb 17 17:25:46 kernel: ? rcu_is_watching+0xd/0x40
Feb 17 17:25:46 kernel: ? srso_return_thunk+0x5/0x5f
Feb 17 17:25:46 kernel: ? lock_release+0x1e5/0x280
Feb 17 17:25:46 kernel: btrfs_sync_file+0x532/0x660 [btrfs]
Feb 17 17:25:46 kernel: __x64_sys_fsync+0x37/0x60
Feb 17 17:25:46 kernel: do_syscall_64+0x79/0x1a0
Feb 17 17:25:46 kernel: entry_SYSCALL_64_after_hwframe+0x6e/0x76
Feb 17 17:25:46 kernel: RIP: 0033:0x7f823f6f8400
Feb 17 17:25:46 kernel: Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 80 3d e1 d1 0d 00 00 74 17 b8 4a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 48 c3 0f 1f 80 00 00 00 00 48 83 ec 18 89 7c
Feb 17 17:25:46 kernel: RSP: 002b:00007ffe3c26e9f8 EFLAGS: 00000202 ORIG_RAX: 000000000000004a
Feb 17 17:25:46 kernel: RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f823f6f8400
Feb 17 17:25:46 kernel: RDX: 0000000000000193 RSI: 0000560cdfcdb048 RDI: 0000000000000003
Feb 17 17:25:46 kernel: RBP: 00000000000002e6 R08: 0000000000000007 R09: 00007ffe3c26ea0c
Feb 17 17:25:46 kernel: R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffe3c26ea10
Feb 17 17:25:46 kernel: R13: 028f5c28f5c28f5c R14: 8f5c28f5c28f5c29 R15: 0000560cdfcd7180
Feb 17 17:25:46 kernel: </TASK>
Feb 17 17:25:46 kernel: irq event stamp: 0
Feb 17 17:25:46 kernel: hardirqs last enabled at (0): [<0000000000000000>] 0x0
Feb 17 17:25:46 kernel: hardirqs last disabled at (0): [<ffffffff810e5e0e>] copy_process+0xb0e/0x1e00
Feb 17 17:25:46 kernel: softirqs last enabled at (0): [<ffffffff810e5e0e>] copy_process+0xb0e/0x1e00
Feb 17 17:25:46 kernel: softirqs last disabled at (0): [<0000000000000000>] 0x0
Feb 17 17:25:46 kernel: ---[ end trace 0000000000000000 ]---
Feb 17 17:25:46 kernel: ------------[ cut here ]------------
Feb 17 17:25:46 kernel: BTRFS: Transaction aborted (error -117)
Feb 17 17:25:46 kernel: WARNING: CPU: 9 PID: 29834 at fs/btrfs/extent-tree.c:3249 __btrfs_free_extent.isra.0+0xf8e/0x1330 [btrfs]
Feb 17 17:25:46 kernel: Modules linked in: dm_flakey algif_hash af_alg xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype iptable_filter ip_tables br_netfilter bridge stp llc overlay sunrpc kvm_amd kvm irqbypass rapl acpi_cpufreq ipmi_ssif i2c_piix4 k10temp btrfs ipmi_si blake2b_generic ipmi_devintf xor ipmi_msghandler raid6_pq bfq loop dm_mod zram bnxt_en ccp pkcs8_key_parser asn1_decoder public_key oid_registry fuse msr ipv6
Feb 17 17:25:46 kernel: CPU: 9 PID: 29834 Comm: fsstress Tainted: G W 6.8.0-rc4-BTRFS-ZNS+ #403
Feb 17 17:25:46 kernel: Hardware name: Supermicro Super Server/H12SSL-NT, BIOS 2.0 02/22/2021
Feb 17 17:25:46 kernel: RIP: 0010:__btrfs_free_extent.isra.0+0xf8e/0x1330 [btrfs]
Feb 17 17:25:46 kernel: Code: 48 c7 c6 40 5d 65 a0 e8 f0 f1 0d 00 c7 44 24 18 01 00 00 00 e9 ed f7 ff ff be 8b ff ff ff 48 c7 c7 68 5d 65 a0 e8 52 69 c1 e0 <0f> 0b e9 30 fb ff ff 48 8b 45 60 48 05 d8 19 00 00 f0 48 0f ba 28
Feb 17 17:25:46 kernel: RSP: 0018:ffffc900090cfb80 EFLAGS: 00010282
Feb 17 17:25:46 kernel: RAX: 0000000000000000 RBX: 0000000f9677c000 RCX: 0000000000000000
Feb 17 17:25:46 kernel: RDX: 0000000000000002 RSI: ffffffff82464302 RDI: 00000000ffffffff
Feb 17 17:25:46 kernel: RBP: ffff8889a044b220 R08: 0000000000009ffb R09: 00000000ffffdfff
Feb 17 17:25:46 kernel: R10: 00000000ffffdfff R11: ffffffff8264dd80 R12: 0000000000000001
Feb 17 17:25:46 kernel: R13: ffff888ad87a4c98 R14: 0000000000000005 R15: ffff888a0c7d2a80
Feb 17 17:25:46 kernel: FS: 00007f823f5f7740(0000) GS:ffff889fcea40000(0000) knlGS:0000000000000000
Feb 17 17:25:46 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 17 17:25:46 kernel: CR2: 0000560ce0610b38 CR3: 0000000a907ec000 CR4: 0000000000350ef0
Feb 17 17:25:46 kernel: Call Trace:
Feb 17 17:25:46 kernel: <TASK>
Feb 17 17:25:46 kernel: ? __btrfs_free_extent.isra.0+0xf8e/0x1330 [btrfs]
Feb 17 17:25:46 kernel: ? __warn+0x81/0x170
Feb 17 17:25:46 kernel: ? __btrfs_free_extent.isra.0+0xf8e/0x1330 [btrfs]
Feb 17 17:25:46 kernel: ? report_bug+0x18d/0x1c0
Feb 17 17:25:47 kernel: ? tick_nohz_tick_stopped+0x12/0x30
Feb 17 17:25:47 kernel: ? srso_return_thunk+0x5/0x5f
Feb 17 17:25:47 kernel: ? handle_bug+0x3c/0x70
Feb 17 17:25:47 kernel: ? exc_invalid_op+0x13/0x60
Feb 17 17:25:47 kernel: ? asm_exc_invalid_op+0x16/0x20
Feb 17 17:25:47 kernel: ? __btrfs_free_extent.isra.0+0xf8e/0x1330 [btrfs]
Feb 17 17:25:47 kernel: ? srso_return_thunk+0x5/0x5f
Feb 17 17:25:47 kernel: ? rcu_is_watching+0xd/0x40
Feb 17 17:25:47 kernel: ? srso_return_thunk+0x5/0x5f
Feb 17 17:25:47 kernel: ? lock_release+0x1e5/0x280
Feb 17 17:25:47 kernel: __btrfs_run_delayed_refs+0x64c/0x1380 [btrfs]
Feb 17 17:25:47 kernel: ? btrfs_commit_transaction+0x3e/0x12d0 [btrfs]
Feb 17 17:25:47 kernel: btrfs_run_delayed_refs+0x92/0x130 [btrfs]
Feb 17 17:25:47 kernel: btrfs_commit_transaction+0xa2/0x12d0 [btrfs]
Feb 17 17:25:47 kernel: ? srso_return_thunk+0x5/0x5f
Feb 17 17:25:47 kernel: ? srso_return_thunk+0x5/0x5f
Feb 17 17:25:47 kernel: ? rcu_is_watching+0xd/0x40
Feb 17 17:25:47 kernel: ? srso_return_thunk+0x5/0x5f
Feb 17 17:25:47 kernel: ? lock_release+0x1e5/0x280
Feb 17 17:25:47 kernel: btrfs_sync_file+0x532/0x660 [btrfs]
Feb 17 17:25:47 kernel: __x64_sys_fsync+0x37/0x60
Feb 17 17:25:47 kernel: do_syscall_64+0x79/0x1a0
Feb 17 17:25:47 kernel: entry_SYSCALL_64_after_hwframe+0x6e/0x76
Feb 17 17:25:47 kernel: RIP: 0033:0x7f823f6f8400
Feb 17 17:25:47 kernel: Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 80 3d e1 d1 0d 00 00 74 17 b8 4a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 48 c3 0f 1f 80 00 00 00 00 48 83 ec 18 89 7c
Feb 17 17:25:47 kernel: RSP: 002b:00007ffe3c26e9f8 EFLAGS: 00000202 ORIG_RAX: 000000000000004a
Feb 17 17:25:47 kernel: RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f823f6f8400
Feb 17 17:25:47 kernel: RDX: 0000000000000193 RSI: 0000560cdfcdb048 RDI: 0000000000000003
Feb 17 17:25:47 kernel: RBP: 00000000000002e6 R08: 0000000000000007 R09: 00007ffe3c26ea0c
Feb 17 17:25:47 kernel: R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffe3c26ea10
Feb 17 17:25:47 kernel: R13: 028f5c28f5c28f5c R14: 8f5c28f5c28f5c29 R15: 0000560cdfcd7180
Feb 17 17:25:47 kernel: </TASK>
Feb 17 17:25:47 kernel: irq event stamp: 0
Feb 17 17:25:47 kernel: hardirqs last enabled at (0): [<0000000000000000>] 0x0
Feb 17 17:25:47 kernel: hardirqs last disabled at (0): [<ffffffff810e5e0e>] copy_process+0xb0e/0x1e00
Feb 17 17:25:47 kernel: softirqs last enabled at (0): [<ffffffff810e5e0e>] copy_process+0xb0e/0x1e00
Feb 17 17:25:47 kernel: softirqs last disabled at (0): [<0000000000000000>] 0x0
Feb 17 17:25:47 kernel: ---[ end trace 0000000000000000 ]---
Feb 17 17:25:47 kernel: BTRFS: error (device nvme1n2: state A) in __btrfs_free_extent:3249: errno=-117 Filesystem corrupted
Feb 17 17:25:47 kernel: BTRFS info (device nvme1n2: state EA): forced readonly
Feb 17 17:25:47 kernel: BTRFS info (device nvme1n2: state EA): leaf 66957836288 gen 3873 total ptrs 203 free space 1102 owner 2
Feb 17 17:25:47 kernel: BTRFS info (device nvme1n2: state EA): refs 2 lock_owner 29834 current 29834
Feb 17 17:25:47 kernel: item 0 key (63394947072 168 40960) itemoff 16230 itemsize 53
Feb 17 17:25:47 kernel: extent refs 1 gen 3835 flags 1
Feb 17 17:25:47 kernel: ref#0: extent data backref root 5 objectid 552 offset 1802240 count 1
(snip)...
Feb 17 17:25:55 kernel: item 164 key (66948923392 169 0) itemoff 8229 itemsize 33
Feb 17 17:25:55 kernel: extent refs 1 gen 3872 flags 2
Feb 17 17:25:55 kernel: ref#0: tree block backref root 2
Feb 17 17:25:55 kernel: item 165 key (66948939776 169 1) itemoff 8196 itemsize 33
Feb 17 17:25:55 kernel: extent refs 1 gen 3873 flags 2
Feb 17 17:25:55 kernel: ref#0: tree block backref root 5
Feb 17 17:25:55 kernel: item 166 key (68719476736 168 110592) itemoff 8143 itemsize 53
Feb 17 17:25:55 kernel: extent refs 1 gen 3841 flags 1
Feb 17 17:25:55 kernel: ref#0: extent data backref root 5 objectid 440 offset 3100672 count 1
(snip)...
Feb 17 17:25:56 kernel: item 202 key (68722249728 168 110592) itemoff 6177 itemsize 53
Feb 17 17:25:56 kernel: extent refs 1 gen 3842 flags 1
Feb 17 17:25:56 kernel: ref#0: extent data backref root 5 objectid 953 offset 5431296 count 1
Feb 17 17:25:56 kernel: BTRFS critical (device nvme1n2: state EA): unable to find ref byte nr 66948939776 parent 66948939776 root 5 owner 0 offset 0 slot 166
Feb 17 17:25:57 kernel: BTRFS error (device nvme1n2: state EA): failed to run delayed ref for logical 66948939776 num_bytes 16384 type 182 action 2 ref_mod 1: -2
Feb 17 17:25:57 kernel: BTRFS: error (device nvme1n2: state EA) in btrfs_run_delayed_refs:2246: errno=-2 No such entry
Feb 17 17:25:57 kernel: BTRFS warning (device nvme1n2: state EA): Skipping commit of aborted transaction.
Feb 17 17:25:57 kernel: BTRFS: error (device nvme1n2: state EA) in cleanup_transaction:2006: errno=-2 No such entry
Feb 17 17:26:02 kernel: BTRFS info (device nvme1n2: state EA): last unmount of filesystem f33f6f52-ef71-4984-898d-ff20390559d1

[Workaround patch]

I added "level" field to struct extent_buffer to store the tree level and
checked if buf->level is the same as btrfs_header_level(buf) or not in
btrfs_free_tree_block(). It revealed buf->level != btrfs_header_level(buf)
can happen. It is because the extent buffer is already zeroed out in
btree_csum_one_bio(). Since the header is also cleared,
btrfs_header_level(buf) returns 0 while it actually should be 1. As a
result, a faulty delayed ref, whose level = 0 not 1, is inserted.

A workaround fix is easy. We can leave the header intact. With this patch,
the failure goes away and generic/013 passed 300 times.

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 8ab185182c30..f43104224c92 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -278,7 +278,9 @@ blk_status_t btree_csum_one_bio(struct btrfs_bio *bbio)
* ordering of I/O without unnecessarily writing out data.
*/
if (test_bit(EXTENT_BUFFER_ZONED_ZEROOUT, &eb->bflags)) {
- memzero_extent_buffer(eb, 0, eb->len);
+ const unsigned long header_size = sizeof(struct btrfs_header);
+
+ memzero_extent_buffer(eb, header_size, eb->len - header_size);
return BLK_STS_OK;
}

[Remaining Questions]

Timeline of the faulty referenced extent buffer should be following:

- (allocated)
- btrfs_clear_buffer_dirty: set EXTENT_BUFFER_ZONED_ZEROOUT
- btree_csum_one_bio: zero out entire buffer
- (written to device?)
- btrfs_free_tree_block: add DROP_DELAYED_REF

Q1. I previously assumed at btrfs_clear_buffer_dirty, there is no reference
left. But, the actual drop of the reference is done after
btrfs_free_tree_block. Is it too early to mark it EXTENT_BUFFER_ZONED_ZEROOUT and
allow it to be zero out? Maybe, if the FS crashed just after writing zeroed
buffer to a device, re-mounted FS will see zero filled node and panic?

Q2. With the "buf->level != btrfs_header_level(buf)" check, I observed some
level mismatch printed with no delayed ref error followed. How can it
happen? I guess it is merged with the existing delayed refs ... but it does
not care the level mismatch?

Q3. My experience on debugging this suggests this bug is highly timing
related: adding debug prints in btrfs_free_tree_block prevents the bug to
appear. What is the reason behind this?