Re: [REGRESSION] next boot regression caused by RIP: 0010:usercopy_abort+0x74/0x76 kernel panic

From: Laura Nao
Date: Tue Jul 16 2024 - 05:29:42 EST


Hello,

On 7/5/24 16:21, Laura Nao wrote:
> On 6/7/24 17:14, Laura Nao wrote:
>> Hello,
>>
>> KernelCI has detected a boot regression affecting all AMD and Intel
>> Chromebooks in the Collabora LAVA lab, occurring between next-20240605
>> and next-20240606.
>>
>> The following kernel panic has been reported in the logs. The trace
>> provided below is from an Acer Chromebook 317, with similar traces
>> observed on other devices:
>>
>> [ 5.944268] RIP: 0010:usercopy_abort+0x74/0x76
>> [ 5.944276] Code: 0f 89 9f 51 48 0f 45 d6 49 c7 c3 ac c1 7c 9f 4c 89 d1 57 48 c7 c6 38 54 7b 9f 48 c7 c7 b5 c1 7c 9f 49 0f 45 f3 e8 b9 8c e4 ff <0f> 0b 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00
>> [ 5.944278] RSP: 0000:ffffb01e8001fb90 EFLAGS: 00010246
>> [ 5.944280] RAX: 0000000000000068 RBX: 0000000000000d80 RCX: 0000000000000000
>> [ 5.944281] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
>> [ 5.944282] RBP: 0000000000000000 R08: 0000000000000003 R09: 2079726f6d656d20
>> [ 5.944284] R10: 79706f6372657375 R11: 79706f6372657375 R12: ffff8e7b400a8800
>> [ 5.944285] R13: 0000000000000d80 R14: 0000000000000000 R15: 00000000ff879a40
>> [ 5.944286] FS: 0000000000000000(0003) GS:ffff8e7bc0100000(0063) knlGS:00000000eca4d440
>> [ 5.944288] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
>> [ 5.944289] CR2: 00000000080e61d0 CR3: 0000000107002000 CR4: 0000000000350ef0
>> [ 5.944290] Call Trace:
>> [ 5.944293] <TASK>
>> [ 5.944295] ? __die_body+0x1b/0x5d
>> [ 5.944300] ? die+0x31/0x4b
>> [ 5.944303] ? do_trap+0x7c/0xfe
>> [ 5.944306] ? usercopy_abort+0x74/0x76
>> [ 5.944309] ? usercopy_abort+0x74/0x76
>> [ 5.944312] ? do_error_trap+0x6f/0x99
>> [ 5.944315] ? usercopy_abort+0x74/0x76
>> [ 5.944318] ? exc_invalid_op+0x4e/0x65
>> [ 5.944321] ? usercopy_abort+0x74/0x76
>> [ 5.944324] ? asm_exc_invalid_op+0x16/0x20
>> [ 5.944327] ? usercopy_abort+0x74/0x76
>> [ 5.944330] __check_heap_object+0xcb/0x110
>> [ 5.944334] __check_object_size+0x181/0x26d
>> [ 5.944336] copy_from_buffer+0x43/0x66
>> [ 5.944340] copy_uabi_to_xstate+0x113/0x194
>> [ 5.944343] __fpu_restore_sig+0x3a3/0x4be
>> [ 5.944347] fpu__restore_sig+0x6c/0x83
>> [ 5.944350] ia32_restore_sigcontext+0x14e/0x16d
>> [ 5.944354] __do_compat_sys_sigreturn+0x7b/0xbc
>> [ 5.944357] do_int80_emulation+0xad/0xd3
>> [ 5.944360] ? handle_mm_fault+0x10e/0x199
>> [ 5.944363] ? exc_page_fault+0x27b/0x42f
>> [ 5.944365] ? fpregs_assert_state_consistent+0x22/0x47
>> [ 5.944368] ? clear_bhb_loop+0x45/0xa0
>> [ 5.944370] ? clear_bhb_loop+0x45/0xa0
>> [ 5.944372] ? clear_bhb_loop+0x45/0xa0
>> [ 5.944374] ? clear_bhb_loop+0x45/0xa0
>> [ 5.944375] ? clear_bhb_loop+0x45/0xa0
>> [ 5.944377] ? clear_bhb_loop+0xe/0xa0
>> [ 5.944379] asm_int80_emulation+0x16/0x20
>> [ 5.944382] RIP: 0023:0xeca52579
>> [ 5.944384] Code: b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 00 00 00 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d b4 26 00 00 00 00 8d b4 26 00 00 00 00
>> [ 5.944386] RSP: 002b:00000000ff879cbc EFLAGS: 00000246
>> [ 5.944387] RAX: 0000000000000060 RBX: 00000000ffffffff RCX: 00000000ff879d08
>> [ 5.944389] RDX: 0000000000000000 RSI: 0000000009b111a0 RDI: 00000000ff879d08
>> [ 5.944390] RBP: 00000000080d1801 R08: 0000000000000000 R09: 0000000000000000
>> [ 5.944391] R10: 0000000000000000 R11: 0000000000000282 R12: 0000000000000000
>> [ 5.944392] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>> [ 5.944393] </TASK>
>> [ 5.944394] Modules linked in:
>> [ 5.944433] ---[ end trace 0000000000000000 ]---
>> [ 6.287986] RIP: 0010:usercopy_abort+0x74/0x76
>> [ 6.293033] Code: 0f 89 9f 51 48 0f 45 d6 49 c7 c3 ac c1 7c 9f 4c 89 d1 57 48 c7 c6 38 54 7b 9f 48 c7 c7 b5 c1 7c 9f 49 0f 45 f3 e8 b9 8c e4 ff <0f> 0b 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00
>> [ 6.313975] RSP: 0000:ffffb01e8001fb90 EFLAGS: 00010246
>> [ 6.319810] RAX: 0000000000000068 RBX: 0000000000000d80 RCX: 0000000000000000
>> [ 6.327780] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
>> [ 6.335744] RBP: 0000000000000000 R08: 0000000000000003 R09: 2079726f6d656d20
>> [ 6.343710] R10: 79706f6372657375 R11: 79706f6372657375 R12: ffff8e7b400a8800
>> [ 6.351678] R13: 0000000000000d80 R14: 0000000000000000 R15: 00000000ff879a40
>> [ 6.359646] FS: 0000000000000000(0003) GS:ffff8e7bc0000000(0063) knlGS:00000000eca4d440
>> [ 6.368680] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
>> [ 6.375098] CR2: 00000000f322e480 CR3: 0000000107002000 CR4: 0000000000350ef0
>> [ 6.383065] Kernel panic - not syncing: Fatal exception
>> [ 6.388907] Kernel Offset: 0x1ba00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>>
>> The full kernel log is available on [1]. The config used was the
>> upstream x86_64 defconfig with a fragment applied on top [2].
>>
>> The issue is still present on next-20240607.
>>
>> I'm sending this report to track the regression while a fix is
>> identified. I'll investigate the issue/run a bisection and report back
>> with the results.
>>
>
> Reverting this series fixes the issue first observed in next-20240606
> (CC Ingo):
> https://lore.kernel.org/all/20240605083557.2051480-1-mingo@xxxxxxxxxx/
>
> The issue is no longer present as of next-20240703, where the series was
> dropped. I'm marking this as resolved for now.
>

The issue started happening again since next-20240712, as the series
landed back on linux-next (see full kernel log [1] and config [2] from a
next-20240715 run).

Ingo, do you have any pointers or suggestions on how we can further debug
this issue?

Thorsten, is there any way to mark this regression as unresolved again?

Thanks,

Laura Nao

[1] https://pastebin.com/raw/saEHbXgY
[2] https://pastebin.com/raw/aC1Kqi4Y

#regzbot introduced: 81106b7e0b