RE: [PATCH][RFC v3] x86, hotplug: Use hlt instead of mwait if invoked from disable_nonboot_cpus
From: Chen, Yu C
Date: Wed Jul 06 2016 - 22:50:28 EST
> -----Original Message-----
> From: Rafael J. Wysocki [mailto:rjw@xxxxxxxxxxxxx]
> Sent: Thursday, July 07, 2016 8:33 AM
> To: Chen, Yu C; James Morse
> Cc: linux-pm@xxxxxxxxxxxxxxx; Thomas Gleixner; H. Peter Anvin; Pavel Machek;
> Borislav Petkov; Peter Zijlstra; Ingo Molnar; Len Brown; x86@xxxxxxxxxx; linux-
> kernel@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH][RFC v3] x86, hotplug: Use hlt instead of mwait if invoked
> from disable_nonboot_cpus
>
> On Tuesday, June 28, 2016 05:16:43 PM Chen Yu wrote:
> > Stress test from Varun Koyyalagunta reports that, the nonboot CPU
> > would hang occasionally, when resuming from hibernation. Further
> > investigation shows that, the precise stage when nonboot CPU hangs, is
> > the time when the nonboot CPU been woken up incorrectly, and tries to
> > monitor the mwait_ptr for the second time, then an exception is
> > triggered due to illegal vaddr access, say, something like, 'Unable to
> > handler kernel address of 0xffff8800ba800010...'
> >
> > Further investigation shows that, this exception is caused by
> > accessing a page without PRESENT flag, because the pte entry for this
> > vaddr is zero. Here's the scenario how this problem
> > happens: Page table for direct mapping is allocated dynamically by
> > kernel_physical_mapping_init, it is possible that in the resume
> > process, when the boot CPU is trying to write back pages to their
> > original address, and just right to writes to the monitor mwait_ptr
> > then wakes up one of the nonboot CPUs, since the page table currently
> > used by the nonboot CPU might not the same as it is before the
> > hibernation, an exception might occur due to inconsistent page table.
> >
> > First try is to get rid of this problem by changing the monitor
> > address from task.flag to zero page, because no one would write data
> > to zero page. But there is still problem because of a ping-pong wake
> > up scenario in mwait_play_dead:
> >
> > One possible implementation of a clflush is a read-invalidate snoop,
> > which is what a store might look like, so cflush might break the mwait.
> >
> > 1. CPU1 wait at zero page
> > 2. CPU2 cflush zero page, wake CPU1 up, then CPU2 waits at zero page
> > 3. CPU1 is woken up, and invoke cflush zero page, thus wake up CPU2 again.
> > then the nonboot CPUs never sleep for long.
> >
> > So it's better to monitor different address for each nonboot CPUs,
> > however since there is only one zero page, at most:
> > PAGE_SIZE/L1_CACHE_LINE CPUs are satisfied, which is usually 64 on a
> > x86_64, apparently it's not enough for servers, maybe more zero pages
> > are required.
> >
> > So choose a new solution as Brian suggested, to put the nonboot CPUs
> > into hlt before resume, without touching any memory during s/r.
> > Theoretically there might still be some problems if some of the CPUs
> > have already been put offline, but since the case is very rare and
> > users can work around it, we do not deal with this special case in
> > kernel for now.
> >
> > BTW, as James mentioned, he might want to encapsulate
> > disable_nonboot_cpus into arch-specific, so this patch might need small
> change after that.
> >
> > Comments and suggestions would be appreciated.
> >
> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=106371
> > Reported-and-tested-by: Varun Koyyalagunta <cpudebug@xxxxxxxxxxxx>
> > Signed-off-by: Chen Yu <yu.c.chen@xxxxxxxxx>
>
> Below is my sort of version of this (untested) and I did it this way, because the
> issue is specific to resume from hibernation (the workaround need not be
> applied anywhere else) and the hibernate_resume_nonboot_cpu_disable()
> thing may be useful to arm64 too if I'm not mistaken (James?).
James might want a flag to distinguish whether it is from suspend or resume,
in his arch-specific disabled_nonboot_cpus?
and this patch works on my xeon.
Tested-by: Chen Yu <yu.c.chen@xxxxxxxxx>
>
> Actually, if arm64 uses it too, the __weak implementation can be dropped,
> because it will be possible to make it depend on ARCH_HIBERNATION_HEADER
> (x86 and arm64 are the only users of that).
>
> Thanks,
> Rafael
>