Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

From: Frederic Weisbecker
Date: Thu Dec 27 2018 - 01:53:43 EST


On Mon, Oct 15, 2018 at 10:58:54PM +0200, Heiner Kallweit wrote:
> On 28.09.2018 15:18, Frederic Weisbecker wrote:
> > On Thu, Sep 27, 2018 at 06:05:46PM +0200, Thomas Gleixner wrote:
> >> On Tue, 28 Aug 2018, Frederic Weisbecker wrote:
> >>> On Fri, Aug 24, 2018 at 07:06:32PM +0200, Heiner Kallweit wrote:
> >>>> I tested it and Frederic is right, it doesn't help. Can it be somehow related to
> >>>> the cpu being brought down during suspend? Because I get the warning only during
> >>>> suspend when the cpu is inactive already (but still online).
> >>>
> >>> It's hard to tell, I haven't been able to reproduce on suspend to disk/mem.
> >>>
> >>> Does this script eventually trigger it after some time?
> >>
> >> Any update to this?
> >
> > Heiner? Can you please test the script I sent to you?
> >
> > Thanks.
> >
> Sorry, took some time .. And yes, running your script triggers the message too.
>
> [ 25.646015] x86: Booting SMP configuration:
> [ 25.646044] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [ 25.664491] smpboot: CPU 1 is now offline
> [ 25.679299] x86: Booting SMP configuration:
> [ 25.679329] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [ 25.698449] smpboot: CPU 1 is now offline
> [ 25.711698] x86: Booting SMP configuration:
> [ 25.711727] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [ 25.729185] NOHZ: local_softirq_pending 202
> [ 25.729229] NOHZ: local_softirq_pending 202
> [ 25.730759] smpboot: CPU 1 is now offline
> [ 25.744053] x86: Booting SMP configuration:
> [ 25.744083] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [ 25.762520] smpboot: CPU 1 is now offline
> [ 25.776834] x86: Booting SMP configuration:
> [ 25.776863] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [ 25.794189] NOHZ: local_softirq_pending 202
> [ 25.796672] smpboot: CPU 1 is now offline
> [ 25.805970] x86: Booting SMP configuration:
> [ 25.805999] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [ 25.827360] smpboot: CPU 1 is now offline
> [ 25.839043] x86: Booting SMP configuration:
> [ 25.839073] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [ 25.858184] NOHZ: local_softirq_pending 202
> [ 25.862182] smpboot: CPU 1 is now offline
> [ 25.873759] x86: Booting SMP configuration:
> [ 25.873789] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [ 25.893385] smpboot: CPU 1 is now offline
>

Sorry, I got sidetracked and almost forgot about it.

So this is triggered by CPU hotplug. At some point the CPU has an
opportunity to go idle and for some reason the timer softirq is still
pending. We need to know which timer this is about and why the timer
softirq keeps pending.

I'm going to need your help again. Can you please run the following (possibly
repeat until it triggers the bug) ?

echo 1 > /sys/devices/system/cpu/cpu1/online

# Pause and reset tracing
echo 0 > /sys/kernel/debug/tracing/tracing_on
echo > /sys/kernel/debug/tracing/trace

# Turn on relevant events
echo 1 > /sys/kernel/debug/tracing/events/timer/timer_*/enable
echo 1 > /sys/kernel/debug/tracing/events/irq/softirq_*/enable
echo 1 > /sys/kernel/debug/tracing/tracing_on

# Trigger
echo 0 > /sys/devices/system/cpu/cpu1/online

echo 0 > /sys/kernel/debug/tracing/tracing_on

And please apply the following patch before. With all that I'll have the
relevant informations stored in /sys/kernel/debug/tracing/per_cpu/cpu1/trace
Please send its content to me. Thanks!

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 69e673b..0e57a3b 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -892,6 +892,7 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
(local_softirq_pending() & SOFTIRQ_STOP_IDLE_MASK)) {
pr_warn("NOHZ: local_softirq_pending %02x\n",
(unsigned int) local_softirq_pending());
+ trace_dump_stack(0);
ratelimit++;
}
return false;