Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

From: Frederic Weisbecker
Date: Mon Aug 27 2018 - 22:25:53 EST


On Fri, Aug 24, 2018 at 07:06:32PM +0200, Heiner Kallweit wrote:
> On 24.08.2018 16:30, Frederic Weisbecker wrote:
> >> Can you try the one I posted in this thread:
> >>
> >> https://lkml.kernel.org/r/alpine.DEB.2.21.1808240851420.1668@xxxxxxxxxxxxxxxxxxxxxxx
> >>
> >> Also below for reference.
> >>
> >> Thanks,
> >>
> >> tglx
> >>
> >> 8<----------------
> >> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> >> index 5b33e2f5c0ed..6aab9d54a331 100644
> >> --- a/kernel/time/tick-sched.c
> >> +++ b/kernel/time/tick-sched.c
> >> @@ -888,7 +888,7 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
> >> if (unlikely(local_softirq_pending() && cpu_online(cpu))) {
> >> static int ratelimit;
> >>
> >> - if (ratelimit < 10 &&
> >> + if (ratelimit < 10 && !in_softirq() &&
> >> (local_softirq_pending() & SOFTIRQ_STOP_IDLE_MASK)) {
> >> pr_warn("NOHZ: local_softirq_pending %02x\n",
> >> (unsigned int) local_softirq_pending());
> >
> > I fear it may not work in his case because it happens in -next and we don't stop
> > the idle tick from IRQ tail anymore. So we shouldn't be interrupting a softirq
> > in this path. Still it's worth trying, I may well be missing something.
> >
> > Thanks.
> >
> I tested it and Frederic is right, it doesn't help. Can it be somehow related to
> the cpu being brought down during suspend? Because I get the warning only during
> suspend when the cpu is inactive already (but still online).

It's hard to tell, I haven't been able to reproduce on suspend to disk/mem.

Does this script eventually trigger it after some time?

#!/bin/bash

do_hotplug()
{
for i in $(seq 1 $2)
do
echo $1 > /sys/devices/system/cpu/cpu$i/online
done
}

LAST_CPU=$(($(nproc)-1))

while true
do
do_hotplug 0 $LAST_CPU
do_hotplug 1 $LAST_CPU
done