Re: Yet another KPTI regression with 4.14.x series in a VM

From: Andy Lutomirski
Date: Sat Jan 13 2018 - 01:09:05 EST


On Fri, Jan 12, 2018 at 1:51 PM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> On Fri, 12 Jan 2018, Laura Abbott wrote:
>
> Cc+ Andy
>
> I'm almost crashed out by now. Andy might have an idea. I'll look again
> tomorrow with brain awake.
>
>> On 01/12/2018 10:51 AM, Thomas Gleixner wrote:
>> > On Fri, 12 Jan 2018, Laura Abbott wrote:
>> > > Fedora got a bug report on 4.14.11 of a panic when booting a
>> > > Fedora guest in a CentOS 6 VM, not reproducible with nopti.
>> > > The issue is still present as of 4.14.13 as well. The only
>> > > report is a panic screenshot
>> > > https://bugzilla.redhat.com/show_bug.cgi?id=1532458
>> > >
>> > > I've lost track of all the fixes that have been flying around,
>> > > is this a new issue or has a fix not yet made it to stable?
>> >
>> > Hmm. Looks kinda familiar, but that has been fixed I think even before
>> > 4.4.11. Could you please ask the reported to provide a full console log via
>> > the VM "serial console" ?
>> >
>> > Thanks,
>> >
>> > tglx
>> >
>>

>> [ 2.910025] PANIC: double fault, error_code: 0x0

Probably a stack overflow

>> [ 2.910025] CPU: 1 PID: 56 Comm: modprobe Not tainted
>> 4.14.13-300.fc27.x86_64 #1
>> [ 2.910025] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
>> [ 2.910025] task: ffff891f78dc3c00 task.stack: ffffa6eac0594000
>> [ 2.910025] RIP: 0010:vprintk_default+0x5/0x30
>> [ 2.910025] RSP: 0000:fffffe000002e000 EFLAGS: 00010046
>> [ 2.910025] RAX: 0000000000000000 RBX: fffffe000002e118 RCX:
>> 0000000000000001
>> [ 2.910025] RDX: 0000000000000000 RSI: fffffe000002e018 RDI:
>> ffffffffbe0715a0
>> [ 2.910025] RBP: fffffe000002e008 R08: ffffffffbe0bb565 R09:
>> ffffffffbe07159b
>> [ 2.910025] R10: fffffe000002e080 R11: 0000000000000000 R12:
>> ffffffffbe070fdd
>> [ 2.910025] R13: 0000000000000000 R14: 0000000000000000 R15:
>> 0000000000000000
>> [ 2.910025] FS: 0000000000000000(0000) GS:ffff891f7fd00000(0000)
>> knlGS:0000000000000000
>> [ 2.910025] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 2.910025] CR2: fffffe000002dff8 CR3: 0000000078da6000 CR4:
>> 00000000000006e0

CR2 and RSP are consistent with stack overflow, but that's not
terribly important, since...

>> [ 2.910025] Call Trace:
>> [ 2.910025] <ENTRY_TRAMPOLINE>
>> [ 2.910025] ? vprintk_func+0x27/0x60
>> [ 2.910025] printk+0x52/0x6e
>> [ 2.910025] __die+0x6b/0xe0
>> [ 2.910025] die+0x2f/0x50
>> [ 2.910025] do_general_protection+0x149/0x160

This is the real problem here.

>> [ 2.910025] general_protection+0x2c/0x60
>> [ 2.910025] RIP: 0010:swapgs_restore_regs_and_return_to_usermode+0x6f/0x80
>> [ 2.910025] RSP: 0000:fffffe000002e1c8 EFLAGS: 00000006
>> [ 2.910025] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
>> 0000000000000000
>> [ 2.910025] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
>> 0000000078da7800
>> [ 2.910025] RBP: 0000000000000000 R08: 0000000000000000 R09:
>> 0000000000000000
>> [ 2.910025] R10: 0000000000000000 R11: 0000000000000000 R12:
>> 0000000000000000
>> [ 2.910025] R13: 0000000000000000 R14: 0000000000000000 R15:
>> 0000000000000000
>> [ 2.910025] </ENTRY_TRAMPOLINE>

ffffffff81a00b3e: 65 48 0f b3 3c 25 8e btr %rdi,%gs:0x1e168e
ffffffff81a00b45: 16 1e 00
ffffffff81a00b48: 48 89 c7 mov %rax,%rdi
ffffffff81a00b4b: eb 08 jmp 0xffffffff81a00b55
ffffffff81a00b4d: 48 89 c7 mov %rax,%rdi
ffffffff81a00b50: 48 0f ba ef 3f bts $0x3f,%rdi
ffffffff81a00b55: 48 81 cf 00 18 00 00 or $0x1800,%rdi
ffffffff81a00b5c: 0f 22 df mov %rdi,%cr3 <-- #GP here
ffffffff81a00b5f: 58 pop %rax
ffffffff81a00b60: 5f pop %rdi

The CPU didn't like writing 0x0000000078da7800 to %cr3.

I'm guessing that CR4.PCIDE=0.

Now this is quite a strange value to write to CR3. The 0x800 part
means that we're using the "user" variant of the address space that
would have ASID=0 and the 0x1000 bit being set corresponds to the user
pgdir, but this is nonsense, since the kernel never uses PCID 0 for
user mode. We always start at 1. The only exception is if
X86_FEATURE_PCID is off. But, if X86_FEATURE_PCID is off, then we
shouldn't be setting any PCID bits.

In fact, it looks like this code is totally bogus and has never been
correct at all. Even in:

commit 4b1d5ae3b103eda43f9d0f85c355bb6995b03a30
Author: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Date: Mon Dec 4 15:07:59 2017 +0100

x86/mm: Use/Fix PCID to optimize user/kernel switches

We have:

.macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
mov %cr3, \scratch_reg

ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID

...

.Lwrcr3_\@:
/* Flip the PGD and ASID to the user version */
orq $(PTI_SWITCH_MASK), \scratch_reg
mov \scratch_reg, %cr3
.Lend_\@:

That's bogus. PTI_SWITCH_MASK is 0x1800, which has PCID = 0x800.

This should probably use an alternative to select between 0x1000 and
0x800 depending on X86_FEATURE_PCID or just use an entirely different
label for the !PCID case.

FWIW, this bit in SAVE_AND_SWITCH_TO_KERNEL_CR3

testq $(PTI_SWITCH_MASK), \scratch_reg
jz .Ldone_\@

is a bit silly, too. It's *correct* (I think), but shouldn't that
just be bt $(PTI_SWITCH_PGTABLES_BIT), \scratch_reg, with the obvious
caveat that the headers don't actually define PTI_SWITCH_PGTABLES_BIT?

Was this code *ever* tested with nopcid?