Re: NOHZ tick-stop error: Non-RCU local softirq work is pending

From: Paul E. McKenney
Date: Thu Dec 10 2020 - 19:48:24 EST


On Fri, Dec 11, 2020 at 01:15:15AM +0100, Frederic Weisbecker wrote:
> On Thu, Dec 10, 2020 at 01:17:56PM -0800, Paul E. McKenney wrote:
> > And please see attached. Lots of output, in fact, enough that it
> > was still dumping when the second instance happened.
>
> Thanks!
>
> So the issue is that ksoftirqd is parked on CPU down with vectors
> still pending. Either:
>
> 1) Ksoftirqd has exited because it has too many to process and it has
> exceeded the time limit, but then it parks, leaving the rest unhandled.
>
> 2) Ksoftirqd has completed its work but something has raised a softirq
> after it got parked.
>
> Can you run the following (on top of the previous patch and boot options)
> so that we see if (and what) it still triggers (in which case we should be in 2) ).

Thank you! I have started it up.

> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 09229ad82209..7d558cb7a037 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -650,7 +650,9 @@ static void run_ksoftirqd(unsigned int cpu)
> * We can safely run softirq on inline stack, as we are not deep
> * in the task stack here.
> */
> - __do_softirq();
> + do {
> + __do_softirq();
> + } while (kthread_should_park() && local_softirq_pending());
> local_irq_enable();
> cond_resched();
> return;

Huh. I guess that self-propagating timers, RCU callbacks, and the
like are non-problems because they cannot retrigger while interrupts
are disabled? But can these things reappear just after the
local_irq_enable()?

In the case of RCU, softirq would need to run on this CPU, which it won't,
so we are good in that case. (Any stranded callbacks will be requeued
onto some other CPU later in the CPU-hotplug offline processing.)

Thanx, Paul

> Thanks!