Re: Unknown NMI after S4 resume

From: Pavel Machek
Date: Thu Jul 25 2024 - 04:27:21 EST


Hi!

> When running S4 test on Zhaoxin platform with Ubuntu22.04 kernel-6.10 we got
> unknown NMI messages after S4 resumed:
>
> [ 115.792224] Uhhuh. NMI received for unknown reason 2d on CPU 0.
> [ 115.792226] Do you have a strange power saving mode enabled?
> [ 115.792228] Dazed and confused, but trying to continue
>
> And reproduced on Intel platform.
>
> After tracing, we find that the reason for this Unknown NMI occurs is as
> follows:
> a, 1st kernel starts normally and NMI watchdog is enabled on all cores;
> b, NMI watchdog is disabled on all cores through the sys interface, then
> variable active_events goto zero;
> c, Start hibernate, create & save hibernation image, then go hibernated;
> d, S4 resume event happened, 2nd kernel starts normally and NMI watchdog is
> enabled on All cores;
> e, 2nd kernel find S4 image and try to restore S4 image;
> f, 2nd kernel disable non-boot CPUs, which would disable NMI watchdog for
> APs;
> g, Restore S4 image saved at step c;
> h, 1st-hibernated kernel restore, re-enable non-boot CPUs, as NMI watchdog
> is disabled in step b, this which would keep APs NMI watchdog disabled;
> Besides, the variable active_events is restored to zero;
>
> But BSP NMI watchdog is still enabled, and the PMC will trigger NMI
> interrupt periodically.
> If PMC NMI triggered, perf_event_nmi_handler will be called, but it would
> see active_events is zero, so it goes out directly and return NMI_DONE;
> This then leads to unknown NMI messages.
>
> static int
> perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
> {
> u64 start_clock;
> u64 finish_clock;
> int ret;
>
> /*
> * All PMUs/events that share this PMI handler should make sure to
> * increment active_events for their events.
> */
> if (!atomic_read(&active_events))
> return NMI_DONE;
> ......
>
> It seems that the BSP does not refer to the settings of the NMI watchdog sys
> interface previously saved to the S4 image to configure the NMI watchdog
> when doing S4 resume.
> Should consider this situation and patch it?

Yes, please.

The watchdog driver should get suspend/resume hooks, and probably do
same init on boot and on resume.

Best regards,

Pavel
--
People of Russia, stop Putin before his war on Ukraine escalates.

Attachment: signature.asc
Description: PGP signature