Re: [PATCH 1/2] x86/mm: Reinitialize TLB state on hotplug and resume

From: Andy Lutomirski
Date: Thu Sep 07 2017 - 21:23:38 EST




> On Sep 7, 2017, at 12:55 PM, Jiri Kosina <jikos@xxxxxxxxxx> wrote:
>
> On Thu, 7 Sep 2017, Ingo Molnar wrote:
>
>>>> When Linux brings a CPU down and back up, it switches to init_mm and then
>>>> loads swapper_pg_dir into CR3. With PCID enabled, this has the side effect
>>>> of masking off the ASID bits in CR3.
>>>>
>>>> This can result in some confusion in the TLB handling code. If we
>>>> bring a CPU down and back up with any ASID other than 0, we end up
>>>> with the wrong ASID active on the CPU after resume. This could
>>>> cause our internal state to become corrupt, although major
>>>> corruption is unlikely because init_mm doesn't have any user pages.
>>>> More obviously, if CONFIG_DEBUG_VM=y, we'll trip over an assertion
>>>> in the next context switch. The result of *that* is a failure to
>>>> resume from suspend with probability 1 - 1/6^(cpus-1).
>>>>
>>>> Fix it by reinitializing cpu_tlbstate on resume and CPU bringup.
>>>>
>>>> Reported-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
>>>> Reported-by: Jiri Kosina <jikos@xxxxxxxxxx>
>>>> Fixes: 10af6235e0d3 ("x86/mm: Implement PCID based optimization: try to preserve old TLB entries using PCID")
>>>> Signed-off-by: Andy Lutomirski <luto@xxxxxxxxxx>
>>>
>>> Tested-by: Jiri Kosina <jkosina@xxxxxxx>
>>
>> The fix should be upstream already, as of 1c9fe4409ce3 and later.
>
> Hm, so I've just experienced two instances in a row of reboot just after
> reading hibernation image (i.e. exactly the same symptom as before) even
> with 3b9f8ed kernel (which contains the fix). Seems like the fix is either
> incomplete (just the probability of it happening is lower), or I'm seeing
> something differet with the same symptom.
>
> I'll try to figure out whether it's the same VM_BUG_ON() triggering, but
> probably will be able to do so only tomorrow.
>

Nah, don't waste your time. I think I see the bug, and it's a different bug. It's an easy one-line fix, but I have to figure out how to test it.

> --
> Jiri Kosina
> SUSE Labs
>