Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

From: Linus Torvalds
Date: Wed Mar 18 2015 - 17:32:20 EST


On Wed, Mar 18, 2015 at 12:26 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>
>> crash> disassemble page_fault
>> Dump of assembler code for function page_fault:
>> 0xffffffff816834a0 <+0>: data32 xchg %ax,%ax
>> 0xffffffff816834a3 <+3>: data32 xchg %ax,%ax
>> 0xffffffff816834a6 <+6>: data32 xchg %ax,%ax
>> 0xffffffff816834a9 <+9>: sub $0x78,%rsp
>> 0xffffffff816834ad <+13>: callq 0xffffffff81683620 <error_entry>
>
> The callq was the double-faulting instruction, and it is indeed the
> first function in here that would have accessed the stack. (The sub
> *changes* rsp but isn't a memory access.) So, since RSP is bogus, we
> page fault, and the page fault is promoted to a double fault. The
> surprising thing is that the page fault itself seems to have been
> delivered okay, and RSP wasn't on a page boundary.

Not at all surprising, and sure it was on a page boundry..

Look closer.

%rsp is 00007fffa55eafb8.

But that's *after* page_fault has done that

sub $0x78,%rsp

so %rsp when the page fault happened was 0x7fffa55eb030. Which is a
different page.

And that page happened to be mapped.

So what happened is:

- we somehow entered kernel mode without switching stacks

(ie presumably syscall)

- the user stack was still fine

- we took a page fault, which once again didn't switch stacks,
because we were already in kernel mode. And this page fault worked,
because it just pushed the error code onto the user stack which was
mapped.

- we now took a second page fault within the page fault handler,
because now the stack pointer has been decremented and points one user
page down that is *not* mapped, so now that page fault cannot push the
error code and return information.

Now, how we took that original page fault is sadly not very clear at
all. I agree that it's something about system-call (how could we not
change stacks otherwise), but why it should have started now, I don't
know. I don't think "system_call" has changed at all.

Maybe there is something wrong with the new "ret_from_sys_call" logic,
and that "use sysret to return to user mode" thing. Because this code
sequence:

+ movq (RSP-RIP)(%rsp),%rsp
+ USERGS_SYSRET64

in 'irq_return_via_sysret' is new to 4.0, and instead of entering the
kernel with a user stack poiinter, maybe we're *exiting* the kernel,
and have just reloaded the user stack pointer when "USERGS_SYSRET64"
takes some fault.

Is PARAVIRT enabled? The three nop's at the beginning of 'page_fault'
makes me suspect it is, and that that is some paravirt rewriting
area. What does paravirt go for that USERGS_SYSRET64 (or for
SWAPGS_UNSAFE_STACK, for that matter).

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/