Re: md raid6 oops in 6.6.4 stable

From: Genes Lists
Date: Thu Dec 07 2023 - 10:58:20 EST


On 12/7/23 09:42, Guoqing Jiang wrote:
Hi,

On 12/7/23 21:55, Genes Lists wrote:
On 12/7/23 08:30, Bagas Sanjaya wrote:
On Thu, Dec 07, 2023 at 08:10:04AM -0500, Genes Lists wrote:
I have not had chance to git bisect this but since it happened in stable I
thought it was important to share sooner than later.

One possibly relevant commit between 6.6.3 and 6.6.4 could be:

   commit 2c975b0b8b11f1ffb1ed538609e2c89d8abf800e
   Author: Song Liu <song@xxxxxxxxxx>
   Date:   Fri Nov 17 15:56:30 2023 -0800

     md: fix bi_status reporting in md_end_clone_io

log attached shows page_fault_oops.
Machine was up for 3 days before crash happened.

Could you decode the oops (I can't find it in lore for some reason) ([1])? And
can it be reproduced reliably? If so, pls share the reproduce step.

[1]. https://lwn.net/Articles/592724/

Thanks,
Guoqing

- reproducing
An rsync runs 2 x / day. It copies to this server from another. The copy is from a (large) top level directory. On the 3rd day after booting 6.6.4, the second of these rysnc's triggered the oops. I need to do more testing to see if I can reliably reproduce. I have not seen this oops on earlier stable kernels.

- decoding oops with scripts/decode_stacktrace.sh had errors :
readelf: Error: Not an ELF file - it has the wrong magic bytes at the start

It appears that the decode script doesn't handle compressed modules. I changed the readelf line to decompress first. This fixes the above script complaint and the result is attached.

gene





Dec 06 19:20:54 s6 kernel: BUG: unable to handle page fault for address: ffff8881019312e8
Dec 06 19:20:54 s6 kernel: #PF: supervisor write access in kernel mode
Dec 06 19:20:54 s6 kernel: #PF: error_code(0x0003) - permissions violation
Dec 06 19:20:54 s6 kernel: PGD 336e01067 P4D 336e01067 PUD 1019ee063 PMD 1019f0063 PTE 8000000101931021
Dec 06 19:20:54 s6 kernel: Oops: 0003 [#1] PREEMPT SMP PTI
Dec 06 19:20:54 s6 kernel: CPU: 3 PID: 773 Comm: md127_raid6 Not tainted 6.6.4-stable-1 #4 784c1c710646cffc1e8cc5978f8f6cec974aa179
Dec 06 19:20:54 s6 kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z370 Extreme4, BIOS P4.20 10/31/2019
Dec 06 19:20:54 s6 kernel: RIP: update_io_ticks+0x2c/0x60
Dec 06 19:20:54 s6 kernel: Code: 1f 00 0f 1f 44 00 00 48 8b 4f 28 48 39 f1 78 17 80 7f 31 00 74 3b 48 8b 47 10 48 8b 78 40 48 8b 4f 28 48 39 f1 79 e9 48 89 c8 <f0> 48 0f b1 77 28 75 de 48 89 f0 48 29 c8 84 d2 b9 01 00 >
All code
========
0: 1f (bad)
1: 00 0f add %cl,(%rdi)
3: 1f (bad)
4: 44 00 00 add %r8b,(%rax)
7: 48 8b 4f 28 mov 0x28(%rdi),%rcx
b: 48 39 f1 cmp %rsi,%rcx
e: 78 17 js 0x27
10: 80 7f 31 00 cmpb $0x0,0x31(%rdi)
14: 74 3b je 0x51
16: 48 8b 47 10 mov 0x10(%rdi),%rax
1a: 48 8b 78 40 mov 0x40(%rax),%rdi
1e: 48 8b 4f 28 mov 0x28(%rdi),%rcx
22: 48 39 f1 cmp %rsi,%rcx
25: 79 e9 jns 0x10
27: 48 89 c8 mov %rcx,%rax
2a:* f0 48 0f b1 77 28 lock cmpxchg %rsi,0x28(%rdi) <-- trapping instruction
30: 75 de jne 0x10
32: 48 89 f0 mov %rsi,%rax
35: 48 29 c8 sub %rcx,%rax
38: 84 d2 test %dl,%dl
3a: b9 .byte 0xb9
3b: 01 00 add %eax,(%rax)
...

Code starting with the faulting instruction
===========================================
0: f0 48 0f b1 77 28 lock cmpxchg %rsi,0x28(%rdi)
6: 75 de jne 0xffffffffffffffe6
8: 48 89 f0 mov %rsi,%rax
b: 48 29 c8 sub %rcx,%rax
e: 84 d2 test %dl,%dl
10: b9 .byte 0xb9
11: 01 00 add %eax,(%rax)
...
Dec 06 19:20:54 s6 kernel: RSP: 0018:ffffc90000c0bb78 EFLAGS: 00010296
Dec 06 19:20:54 s6 kernel: RAX: cccccccccccccccc RBX: ffff8881019312c0 RCX: cccccccccccccccc
Dec 06 19:20:54 s6 kernel: RDX: 0000000000000001 RSI: 0000000110f28f4e RDI: ffff8881019312c0
Dec 06 19:20:54 s6 kernel: RBP: 0000000000000001 R08: ffff888104cc1760 R09: 0000000080200016
Dec 06 19:20:54 s6 kernel: R10: ffff88851f0ced00 R11: ffff8888beffb000 R12: 0000000000000008
Dec 06 19:20:54 s6 kernel: R13: 0000000000000028 R14: 0000000000000008 R15: 0000000000000048
Dec 06 19:20:54 s6 kernel: FS: 0000000000000000(0000) GS:ffff88889eec0000(0000) knlGS:0000000000000000
Dec 06 19:20:54 s6 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 06 19:20:54 s6 kernel: CR2: ffff8881019312e8 CR3: 0000000336020002 CR4: 00000000003706e0
Dec 06 19:20:54 s6 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec 06 19:20:54 s6 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Dec 06 19:20:54 s6 kernel: Call Trace:
Dec 06 19:20:54 s6 kernel: <TASK>
Dec 06 19:20:54 s6 kernel: ? __die+0x23/0x70
Dec 06 19:20:54 s6 kernel: ? page_fault_oops+0x171/0x4e0
Dec 06 19:20:54 s6 kernel: ? exc_page_fault+0x175/0x180
Dec 06 19:20:54 s6 kernel: ? asm_exc_page_fault+0x26/0x30
Dec 06 19:20:54 s6 kernel: ? update_io_ticks+0x2c/0x60
Dec 06 19:20:54 s6 kernel: bdev_end_io_acct+0x63/0x160
Dec 06 19:20:54 s6 kernel: md_end_clone_io+0x75/0xa0 md_mod
Dec 06 19:20:54 s6 kernel: handle_stripe_clean_event+0x1ee/0x430 raid456
Dec 06 19:20:54 s6 kernel: handle_stripe+0x7b6/0x1ac0 raid456
Dec 06 19:20:54 s6 kernel: handle_active_stripes.isra.0+0x38d/0x550 raid456
Dec 06 19:20:54 s6 kernel: raid5d+0x488/0x750 raid456
Dec 06 19:20:54 s6 kernel: ? lock_timer_base+0x61/0x80
Dec 06 19:20:54 s6 kernel: ? prepare_to_wait_event+0x60/0x180
Dec 06 19:20:54 s6 kernel: ? __pfx_md_thread+0x10/0x10 md_mod
Dec 06 19:20:54 s6 kernel: md_thread+0xab/0x190 md_mod
Dec 06 19:20:54 s6 kernel: ? __pfx_autoremove_wake_function+0x10/0x10
Dec 06 19:20:54 s6 kernel: kthread+0xe5/0x120
Dec 06 19:20:54 s6 kernel: ? __pfx_kthread+0x10/0x10
Dec 06 19:20:54 s6 kernel: ret_from_fork+0x31/0x50
Dec 06 19:20:54 s6 kernel: ? __pfx_kthread+0x10/0x10
Dec 06 19:20:54 s6 kernel: ret_from_fork_asm+0x1b/0x30
Dec 06 19:20:54 s6 kernel: </TASK>
Dec 06 19:20:54 s6 kernel: Modules linked in: algif_hash af_alg mptcp_diag xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache netfs nft_ct>
Dec 06 19:20:54 s6 kernel: snd_hda_codec kvm snd_hda_core drm_buddy snd_hwdep iTCO_wdt i2c_algo_bit mei_pxp intel_pmc_bxt snd_pcm mei_hdcp ee1004 irqbypass ttm iTCO_vendor_support rapl drm_display_helper nls_iso8859_1>
Dec 06 19:20:54 s6 kernel: CR2: ffff8881019312e8
Dec 06 19:20:54 s6 kernel: ---[ end trace 0000000000000000 ]---