Re: [PATCH v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

From: Andy Lutomirski
Date: Fri May 31 2019 - 12:55:06 EST


On Fri, May 31, 2019 at 9:19 AM Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
>
> On Fri, May 31, 2019 at 05:41:18PM +0200, Jiri Kosina wrote:
> > On Fri, 31 May 2019, Josh Poimboeuf wrote:
> >
> > > The only question I'd have is if we have data on the power savings
> > > difference between hlt and mwait. mwait seems to wake up on a lot of
> > > different conditions which might negate its deeper sleep state.
> >
> > hlt wakes up on basically the same set of events, but has the
> > auto-restarting semantics on some of them (especially SMM). So the wakeup
> > frequency itself shouldn't really contribute to power consumption
> > difference; it's the C-state that mwait allows CPU to enter.
>
> Ok. I reluctantly surrender :-) For your v4:
>
> Reviewed-by: Josh Poimboeuf <jpoimboe@xxxxxxxxxx>
>
> It works as a short term fix, but it's fragile, and it does feel like
> we're just adding more duct tape, as Andy said.
>

Just to clarify what I was thinking, it seems like soft-offlining a
CPU and resuming a kernel have fundamentally different requirements.
To soft-offline a CPU, we want to get power consumption as low as
possible and make sure that MCE won't kill the system. It's okay for
the CPU to occasionally execute some code. For resume, what we're
really doing is trying to hand control of all CPUs from kernel A to
kernel B. There are two basic ways to hand off control of a given
CPU: we can jump (with JMP, RET, horrible self-modifying code, etc)
from one kernel to the other, or we can attempt to make a given CPU
stop executing code from either kernel at all and then forcibly wrench
control of it in kernel B. Either approach seems okay, but the latter
approach depends on getting the CPU to reliably stop executing code.
We don't care about power consumption for resume, and I'm not even
convinced that we need to be able to survive an MCE that happens while
we're resuming, although surviving MCE would be nice.

So if we don't want to depend on nasty system details at all, we could
have the first kernel explicitly wake up all CPUs and hand them all
off to the new kernel, more or less the same way that we hand over
control of the BSP right now. Or we can look for a way to tell all
the APs to stop executing kernel code, and the only architectural way
I know of to do that is to sent an INIT IPI (and then presumably
deassert INIT -- the SDM is a bit vague).

Or we could allocate a page, stick a GDT, a TSS, and a 1: hlt; jmp 1b
in it, turn off paging, and run that code. And then somehow convince
the kernel we load not to touch that page until it finishes waking up
all CPUs. This seems conceptually simple and very robust, but I'm not
sure it fits in with the way hibernation works right now at all.