Kernel 5.3.x, 5.2.2+: VMware player suspend on 64/32 bit guests

From: Woody Suwalski
Date: Sat Aug 10 2019 - 07:24:39 EST


Moving the thread to LKML, as suggested by Thomas...

---------- Forwarded message ---------
From: Woody Suwalski <terraluna977@xxxxxxxxx>
Date: Thu, Aug 1, 2019 at 3:45 PM
Subject: Intermittent suspend on 5.3 / 5.2
To: Rafael J. Wysocki <rjw@xxxxxxxxxxxxx>


Hi RafaÅ,
I know that you are investigating some issues between these 2 kernels,
however I see probably an unrelated problem with suspend on 5.3 and
5.2.4. I think it has creeped in to 5.1.21 as well, but not sure (it is
intermittent). So far 4.20.17 works OK, and I think 5.2.0 works OK.
The problem I see is on both 32 and 64 bit VMs, in VMware workstation
15. The VM is trying to suspend when no activity. It leaves out a black
box with cursor in top-left position. Upon wakeup from VMware it goes to
vmware pre-bios screen, and then expands the black box to the run-size
and switches to X.
The problem with new kernels is that (I think) the suspend fails - the
black box with cursor is there, but seems bigger, and of course is not
wake'able (have to reset). In kern.log suspend seems be running OK, and
then new dmesg lines kick in, and no obvious culprit.
So looking for a free advice .
a. You already know what it is
b. You may have suggestions as to which upstream patch could be to blame
c. I should boot with some debug params (console_off=0, or some other?)
and get some real info?

BTW. For suspend to work I had to override mem_sleep to [shallow], or
maybe later to [s2idle] (the actual VMs are at work, referring from
memory...)

If you have any ideas, all are welcomed
Thanks, Woody



On 8/6/2019 3:18 PM, Woody Suwalski wrote:
Rafal, the patch (in 5.3-rc3)

Fixes: f850a48a0799 ("ACPI: PM: Allow transitions to D0 to occur in
special cases")

does not fix the issue - it must be something else...

Sorry for the late response.

There are known issues in 5.3-rc related to power management which should be fixed in -rc4. Please try that one when it is out.

Cheers!



Thomas Gleixner wrote:
Woody,

On Fri, 9 Aug 2019, Woody Suwalski wrote:

For future things like this, please CC LKML. There is nothing secrit here
and CC'ing the mailing list allows other people to find this and spare
themself the whole bisection pain. Asided of that private mail does not
scale. On the list other people can look at it and give input eventually.

After bisecting I have found the potential culprit:
dfe0cf8b x86/ioapic: Implement irq_get irqchip_state() callback

I am repeating the bisection from start to re-confirm.

Reverse-patch on 5.3-rc3 (64bit) is fixing the problem for me.
What is unclear - just adding the patch to 5.2.1 does not seem to
break it. So there is some more magic involved.
Of course it does not do anything because 5.2.1 is not having

f4999a2a3a48 ("genirq: Add optional hardware synchronization for shutdown")
Thomas, any suggestions?
What that means is that there is an interrupt shutdown which hits the
condition where an interrupt _IS_ marked in the IOAPIC as delivered to a
CPU, but not serviced yet.

Now the question is why it is not serviced. suspend_device_irqs() is
calling into synchronize_irq(), which is probably the place where that
it hangs. But that's called with CPUs online and interrupts enabled.

The reproduce methodology: use VMware player 15, either 32 or 64 bit build.
reboot and run "systemctl suspend". The first suspend works OK. The
second usually locks on kernels 5.2.2 and up. Maybe try 4 times to
confirm good (it is intermittent).
-ENOVMWAREPLAYER and I'm traveling so I don't have a machine handy to
install it. So if you can't debug it deeper down, I'm not going to have a
chance to look at it before the end of next week.

That said, can we please move this to LKML?

Thanks,

tglx


I can add some printk's into synchronize_irq(), however no idea if they will be survive in the kmsg log after a next power-reset. I can wait for a week :-)

Thanks, Woody