Re: [PATCH v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

From: Andy Lutomirski
Date: Fri May 31 2019 - 10:28:53 EST


On Fri, May 31, 2019 at 1:57 AM Rafael J. Wysocki <rjw@xxxxxxxxxxxxx> wrote:
>
> On Friday, May 31, 2019 10:47:21 AM CEST Jiri Kosina wrote:
> > On Fri, 31 May 2019, Josh Poimboeuf wrote:
> >
> > > > I disagree with that from the backwards compatibility point of view.
> > > >
> > > > I personally am quite frequently using differnet combinations of
> > > > resumer/resumee kernels, and I've never been biten by it so far. I'd guess
> > > > I am not the only one.
> > > > Fixmap sort of breaks that invariant.
> > >
> > > Right now there is no backwards compatibility because nosmt resume is
> > > already broken.
> >
> > Yeah, well, but that's "only" for nosmt kernels at least.
> >
> > > For "future" backwards compatibility we could just define a hard-coded
> > > reserved fixmap page address, adjacent to the vsyscall reserved address.
> > >
> > > Something like this (not yet tested)? Maybe we could also remove the
> > > resume_play_dead() hack?
> >
> > Does it also solve cpuidle case? I have no overview what all the cpuidle
> > drivers might be potentially doing in their ->enter_dead() callbacks.
> > Rafael?
>
> There are just two of them, ACPI cpuidle and intel_idle, and they both should
> be covered.
>
> In any case, I think that this is the way to go here even though it may be somewhat
> problematic to start with.
>

Given that there seems to be a genuine compatibility issue right now,
can we design an actual sane way to hand off control of all CPUs
rather than adding duct tape to an extremely fragile mechanism? I can
think of at least two sensible solutions:

1. Have a self-contained "play dead for kexec/resume" function that
touches only few well-defined physical pages: a set of page tables and
a page of code. Load CR3 to point to those page tables, fill in the
code with some form of infinite loop, and run it. Or just turn off
paging entirely and run the infinite loop. Have the kernel doing the
resuming inform the kernel being resumed of which pages these are, and
have the kernel being resumed take over all CPUs before reusing the
pages.

2. Put the CPU all the way to sleep by sending it an INIT IPI.

Version 2 seems very simple and robust. Is there a reason we can't do
it? We obviously don't want to do it for normal offline because it
might be a high-power state, but a cpu in the wait-for-SIPI state is
not going to exit that state all by itself.

The patch to implement #2 should be short and sweet as long as we are
careful to only put genuine APs to sleep like this. The only downside
I can see is that an new kernel resuming and old kernel that was
booted with nosmt is going to waste power, but I don't think that's a
showstopper.