Re: [PATCH tick-sched] Clarify "NOHZ: local_softirq_pending" warning

From: Andy Lutomirski
Date: Sat Jun 27 2020 - 18:14:20 EST



> On Jun 27, 2020, at 2:46 PM, Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
>
> ïOn Sat, Jun 27, 2020 at 02:02:15PM -0700, Andy Lutomirski wrote:
>>> On Fri, Jun 26, 2020 at 2:05 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
>>>
>>> Currently, can_stop_idle_tick() prints "NOHZ: local_softirq_pending HH"
>>> (where "HH" is the hexadecimal softirq vector number) when one or more
>>> non-RCU softirq handlers are still enablded when checking to stop the
>>> scheduler-tick interrupt. This message is not as enlightening as one
>>> might hope, so this commit changes it to "NOHZ tick-stop error: Non-RCU
>>> local softirq work is pending, handler #HH.
>>
>> Thank you! It would be even better if it would explain *why* the
>> problem happened, but I suppose this code doesn't actually know.
>
> Glad to help!
>
> To your point, is it possible to bisect the appearance of this message,
> or is it as usual non-reproducible? (Hey, had to ask!)
>
>

In this particular case, I tracked it down by good old fashioned sleuthing for bugs, but itâs still unclear to me precisely how NOHZ gets involved. The bug is that we were entering the kernel from usermode, doing nmi_enter(), turning on interrupts, maybe getting a page fault, raising a signal, turning off interrupts, nmi_exit(), and back to usermode, with the signal still queued and undelivered. This is all kinds of bad, but I still donât understand what softirqs or idle have to do with it.

But I have the bug fixed now!