Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

From: Heiner Kallweit
Date: Fri Dec 28 2018 - 01:39:46 EST


On 28.12.2018 07:34, Heiner Kallweit wrote:
> On 28.12.2018 02:31, Frederic Weisbecker wrote:
>> On Fri, Dec 28, 2018 at 12:11:12AM +0100, Heiner Kallweit wrote:
>>>
> [...]
>>
>> Interesting, the softirq is raised from hardirq but it's not handled in the end of
>> the IRQ. Are you running threaded IRQS by any chance? If so I would expect ksoftirqd
>> to handle the pending work before we go idle. However I can imagine a small window
>> where such an expectation may not be met: if the softirq is raised after the ksoftirqd
>> thread is parked (CPUHP_AP_SMPBOOT_THREADS), which is right before we disable the CPU
>> (CPUHP_TEARDOWN_CPU).
>>
> I have a network driver (r8169) using NAPI which runs in softirq context AFAIK.
> For testing purposes I sometimes trigger system suspend via network, so there is
> network adapter activity when system suspends. Apart from that nothing really
> exciting:
> CPU0 CPU1 CPU2 CPU3
> 0: 43 0 0 0 IO-APIC 2-edge timer
> 1: 4 0 0 0 IO-APIC 1-edge i8042
> 8: 0 1 0 0 IO-APIC 8-fasteoi rtc0
> 9: 0 0 0 0 IO-APIC 9-fasteoi acpi
> 12: 0 0 0 5 IO-APIC 12-edge i8042
> 120: 0 0 0 0 PCI-MSI 311296-edge PCIe PME
> 121: 0 0 0 0 PCI-MSI 315392-edge PCIe PME
> 122: 0 0 0 0 PCI-MSI 327680-edge PCIe PME
> 123: 0 0 3328 0 PCI-MSI 294912-edge ahci[0000:00:12.0]
> 124: 0 133 0 0 PCI-MSI 344064-edge xhci_hcd
> 125: 0 0 32 0 PCI-MSI 245760-edge mei_me
> 127: 381 0 0 0 PCI-MSI 1572864-edge enp3s0
> 128: 0 0 0 236 PCI-MSI 32768-edge i915
> 129: 0 374 0 0 PCI-MSI 229376-edge snd_hda_intel:card0
>
>> I don't know if we can afford to ignore a softirq even at this late stage. We should
>> probably avoid leaking any. So here is a possible fix, if you don't mind trying:
>>
> I tested your patch and at least in the first minutes of testing couldn't reproduce
> the issue any longer. I tested manual system suspend and the following script you
> sent when we started to analyze the issue.
>

Also after some more time the issue didn't occur again. So it seems your analysis
was right and also the approach to fix it. Thanks!
Will let you know in case the issue should pop up again under special
circumstances.


> Heiner
>
> --------------------------------------------------------------------------
>
> #!/bin/bash
>
> do_hotplug()
> {
> for i in $(seq 1 $2)
> do
> echo $1 > /sys/devices/system/cpu/cpu$i/online
> done
> }
>
> LAST_CPU=$(($(nproc)-1))
>
> while true
> do
> do_hotplug 0 $LAST_CPU
> do_hotplug 1 $LAST_CPU
> done
>