Re: Linux 4.15-rc2: Regression in resume from ACPI S3

From: Rafael J. Wysocki
Date: Wed Dec 06 2017 - 09:05:26 EST


On Wednesday, December 6, 2017 1:23:34 PM CET Thomas Gleixner wrote:
> On Wed, 6 Dec 2017, Michal Hocko wrote:
> > merging tip/x86/urgent on top of your tree fixed this problem for me,
> > but I am seeing something else
> > [ 131.711412] ACPI: Preparing to enter system sleep state S3
> > [ 131.755328] ACPI: EC: event blocked
> > [ 131.755328] ACPI: EC: EC stopped
> > [ 131.755328] PM: Saving platform NVS memory
> > [ 131.755344] Disabling non-boot CPUs ...
> > [ 131.779330] IRQ 124: no longer affine to CPU1
> > [ 131.780334] smpboot: CPU 1 is now offline
> > [ 131.804465] smpboot: CPU 2 is now offline
> > [ 131.827291] IRQ 122: no longer affine to CPU3
> > [ 131.827292] IRQ 123: no longer affine to CPU3
> > [ 131.828293] smpboot: CPU 3 is now offline
> > [ 131.830991] ACPI: Low-level resume complete
> > [ 131.831092] ACPI: EC: EC started
> > [ 131.831093] PM: Restoring platform NVS memory
> > [ 131.831864] do_IRQ: 0.55 No irq handler for vector
>
> Hmm, that's really odd.
>
> > [ 131.831884] Enabling non-boot CPUs ...
> > [ 131.831909] x86: Booting SMP configuration:
> > [ 131.831910] smpboot: Booting Node 0 Processor 1 APIC 0x2
> > [ 131.832913] cache: parent cpu1 should not be sleeping
>
> This is an old one.
>
> > [ 131.833058] CPU1 is up
> > [ 131.833067] smpboot: Booting Node 0 Processor 2 APIC 0x1
> > [ 131.833864] cache: parent cpu2 should not be sleeping
> > [ 131.833983] CPU2 is up
> > [ 131.833995] smpboot: Booting Node 0 Processor 3 APIC 0x3
> > [ 131.834776] cache: parent cpu3 should not be sleeping
> > [ 131.834923] CPU3 is up
> >
> > "No irq handler" part looks a bit scary (maybe related to lost affinity
> > messages?) but the following messages look quite as well. Is this
> > something known? The system seems to be up and running without any
> > visible issues.
>
> I assume it's due to the affinity break, just that we don't know right now
> on which CPU that do_IRQ() message triggered. I assume it's CPU0 because
> the others are offline already, but ....

This is resume from S3, so the firmware might do something odd to the other
CPUs, but in case it didn't (which is quite likely or we would have seen more
of these messages), they are offline and in mwait_play_dead(), so IMO it is
safe to assume that this was CPU0.

And this appears to have happened at the atch_suspend_enable_irqs() time,
which is just local_irq_enable() on x86 running on CPU0.

> I'll think about it how we can figure out what's going on.

It looks like an interrupt that have triggered right after we've enabled
interrupts on the boot CPU.

Thanks,
Rafael