Re: Kernel Freeze with American Megatrends BIOS

From: Roland Singer
Date: Wed Aug 31 2016 - 16:07:53 EST

Here is Peter Wu's reply, which was not send to the mailing list, because
I had to resend my e-mail to him due to a failure...

-------- Forwarded Message --------
Subject: Re: Fwd: Re: Kernel Freeze with American Megatrends BIOS
Date: Wed, 31 Aug 2016 18:08:53 +0200
From: Peter Wu <peter@xxxxxxxxxxxxx>
To: Roland Singer <roland.singer@xxxxxxxxxxxxx>

On Wed, Aug 31, 2016 at 05:56:18PM +0200, Roland Singer wrote:

> > If you look at my notes.txt, you will see that _OFF always executes the
> > same code. PGON differs. When the problem occurs, "Q0L0" somehow always
> > reads back as non-zero and LNKS < 7.
> >
> Oh you're Lekensteyn ^^

Yes, that's me :) I wrote bbswitch, did the Optimus and PR3 ACPI support
in nouveau so I am fairly certain what happens behind the scenes.

> I don't have LNKS and no while loop after calling LKEN ?!

Yes that is what I said in

"Other affected devices have similar code, differences are small:
No check for LNKS (avoids the infinite loop, but device is still off)"

> >>
> >> I noticed following:
> >>
> >> 1. Blacklist nouveau
> >> 2. Boot to GDM login manager (Wayland)
> >> 3. Switch to TTY with CTRL+ALT+FN2
> >> 4. Load bbswitch
> >> 5. Switch off GPU
> >> 6. run lspci -> no freeze
> >> 7. Switch to GDM
> >> 8. Login to a Wayland session (X11 won't work)
> >> 9. run lspci in a GUI terminal -> system freezes
> >
> > Is nouveau somehow loaded anyway? All those extra components (X11,
> > Wayland, etc.) are unnecessary to reproduce the core problem. It occurs
> > whenever the device is being resumed (either via DSM/_PS0 or via power
> > resource PG00._ON).
> >
> Sorry that was nonsense. The steps to reproduce the problem are still valid.
> I didn't wait enough to power it down...
> But whats interesting:
> 1. Blacklist nouveau
> 2. Load bbswitch
> 3. Power off GPU with bbswitch
> 4. Power on GPU with bbswitch
> 5. Run lspci
> 6. Power off GPU with bbswitch
> 7. Run lspci -> freeze
> So setting the GPU power state with bbswitch works as expected.
> Powering it on is also fine. I did this a couple of times.
> But powering it off and letting lspci powering it on, ends in a race.

In some cases I also found that it does always happen at the first try,
but with nouveau it always seem to happen.

> It might be, that lspci does not only power the GPU on, but triggers
> another pci action which causes the race condition.
> Does this have something to do with your quote about the retrain bit?

That is an interesting hypothesis. Even if you invoke `lspci -s01:00.0`
for example, it will always probe for all devices. So maybe interaction
with its parent device (PCI root port 00:02.0) causes issues.

However I also tested without lspci before, and the problem still
exists. You can trigger runtime resume via (as root):

echo > /sys/bus/pci/0000:01:00.0/power/control on

Set it to "auto" to make it sleep again.
Kind regards,
Peter Wu