Re: watchdog: BUG: soft lockup in note_gp_changes in kernel/rcu/tree.c

From: Kun Hu
Date: Mon Jan 06 2025 - 01:38:25 EST




> 2025年1月3日 08:16,Paul E. McKenney <paulmck@xxxxxxxxxx> 写道:
>
> On Thu, Jan 02, 2025 at 10:59:27AM +0800, Kun Hu wrote:
>> Hello,
>>
>> When using our customed fuzzer tool to fuzz the latest Linux kernel, the following crash
>> was triggered.
>>
>> HEAD commit: dbfac60febfa806abb2d384cb6441e77335d2799
>> git tree: upstream
>> Console output: https://drive.google.com/file/d/1D3EDxDxPi0t7m_Z4Uc4FuL26DnHs7yTa/view?usp=sharing
>> Kernel config: https://drive.google.com/file/d/1m1mk_YusR-tyusNHFuRbzdj8KUzhkeHC/view?usp=sharing
>> C reproducer: /
>> Syzlang reproducer: /
>>
>> We observed a crash at line 1333 in note_gp_changes, likely caused by a race condition involving rcu_gp_kthread_wake and note_gp_changes. The issue appears to involve insufficient or incorrect synchronization, as indicated by the involvement of _raw_spin_unlock_irqrestore in spinlock.c. Specifically, this may lead to invalid accesses to rcu_state.gp_kthread or related flags (e.g., gp_flags), potentially resulting in unexpected behavior in swake_up_one_online.
>>
>> Could you please help check if this needs to be addressed?
>
> This is a new one on me.
>
> This is running in a guest OS. Might the underlying hypervisor be
> overloaded? That could result in vCPU preemption and thus in this sort
> of soft lockup.
>
> Also, when I check out the above commit (which is v6.13-rc4), I find that
> line 1333 is the close curly brace of note_gp_changes(). Of course, it is
> possible that the address-to-symbol translation failed (please check!),
> but in the absence of such failure, there is no way that I know of that
> incorrect synchronization could cause a soft lockup at that location.
>
> Other things besides vCPU preemption that could cause a soft lockup at
> that location include corrupted kernel text, corrupted kernel stack,
> and incessant interrupts.
>
> Other thoughts?
>
> Thanx, Paul
>

Sorry for late,

I double-checked that it's not the address-to-symbol translation failing, and the vCPU resources aren't overloaded. Additionally, I tried to reproduce multiple rounds using Syzkaller to get two types of reproducers, c and syscall sequences. i'm not sure if there are any other issues, that's all I can offer for now.

Not sure if this information is useful to you, if it really isn't a real bug, please ignore it.

C reproducer: https://drive.google.com/file/d/1niejFamwXcRumUsn1Ur8xiX2jfZAcown/view?usp=sharing
Syscall sequence reproducer: https://drive.google.com/file/d/1gBfe_WZZeHfrhTlXp5zJfV7be21iGCAC/view?usp=sharing
New log info: https://drive.google.com/file/d/1x7eugPh2RUUF9lOf3s9K64pARkkUE1Qn/view?usp=sharing

----
Thanks,
Kun Hu