Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

From: Andy Lutomirski
Date: Wed Mar 18 2015 - 17:56:23 EST

Next message: Suman Anna: "Re: [PATCH v8 0/4] hwspinlock core & omap dt support"
Previous message: Scott Branden: "[PATCH v2] dt-bindings: brcm: rationalize Broadcom documentation naming"
In reply to: Denys Vlasenko: "Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?"
Next in thread: Denys Vlasenko: "Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, Mar 18, 2015 at 2:42 PM, Denys Vlasenko <dvlasenk@xxxxxxxxxx> wrote:
> On 03/18/2015 10:32 PM, Linus Torvalds wrote:
>> On Wed, Mar 18, 2015 at 12:26 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>>>
>>>> crash> disassemble page_fault
>>>> Dump of assembler code for function page_fault:
>>>> 0xffffffff816834a0 <+0>: data32 xchg %ax,%ax
>>>> 0xffffffff816834a3 <+3>: data32 xchg %ax,%ax
>>>> 0xffffffff816834a6 <+6>: data32 xchg %ax,%ax
>>>> 0xffffffff816834a9 <+9>: sub $0x78,%rsp
>>>> 0xffffffff816834ad <+13>: callq 0xffffffff81683620 <error_entry>
>>>
>>> The callq was the double-faulting instruction, and it is indeed the
>>> first function in here that would have accessed the stack. (The sub
>>> *changes* rsp but isn't a memory access.) So, since RSP is bogus, we
>>> page fault, and the page fault is promoted to a double fault. The
>>> surprising thing is that the page fault itself seems to have been
>>> delivered okay, and RSP wasn't on a page boundary.
>>
>> Not at all surprising, and sure it was on a page boundry..
>>
>> Look closer.
>>
>> %rsp is 00007fffa55eafb8.
>>
>> But that's *after* page_fault has done that
>>
>> sub $0x78,%rsp
>>
>> so %rsp when the page fault happened was 0x7fffa55eb030. Which is a
>> different page.

Ah, I forgot to add 0x78. You're right, of course.

>>
>> And that page happened to be mapped.
>>
>> So what happened is:
>>
>> - we somehow entered kernel mode without switching stacks
>>
>> (ie presumably syscall)
>>
>> - the user stack was still fine
>>
>> - we took a page fault, which once again didn't switch stacks,
>> because we were already in kernel mode. And this page fault worked,
>> because it just pushed the error code onto the user stack which was
>> mapped.
>>
>> - we now took a second page fault within the page fault handler,
>> because now the stack pointer has been decremented and points one user
>> page down that is *not* mapped, so now that page fault cannot push the
>> error code and return information.
>>
>> Now, how we took that original page fault is sadly not very clear at
>> all. I agree that it's something about system-call (how could we not
>> change stacks otherwise), but why it should have started now, I don't
>> know. I don't think "system_call" has changed at all.
>>
>> Maybe there is something wrong with the new "ret_from_sys_call" logic,
>> and that "use sysret to return to user mode" thing. Because this code
>> sequence:
>>
>> + movq (RSP-RIP)(%rsp),%rsp
>> + USERGS_SYSRET64
>>
>> in 'irq_return_via_sysret' is new to 4.0, and instead of entering the
>> kernel with a user stack poiinter, maybe we're *exiting* the kernel,
>> and have just reloaded the user stack pointer when "USERGS_SYSRET64"
>> takes some fault.
>
> Yes, so far we happily thought that SYSRET never fails...
>
> This merits adding some code which would at least BUG_ON
> if the faulting address is seen to match SYSRET64.

sysret64 can only fail with #GP, and we're totally screwed if that
happens, although I agree about the BUG_ON in principle. Where would
we add it that would help in this case, though? We never even made it
to C code.

In any event, this was a page fault. sysret64 doesn't access memory.

>
> Now we only check for faulting IRETQ:
>
> error_kernelspace:
> CFI_REL_OFFSET rcx, RCX+8
> incl %ebx
> leaq native_irq_return_iret(%rip),%rcx
> cmpq %rcx,RIP+8(%rsp)
> je error_bad_iret
>
>>
>> Is PARAVIRT enabled? The three nop's at the beginning of 'page_fault'
>> makes me suspect it is, and that that is some paravirt rewriting
>> area. What does paravirt go for that USERGS_SYSRET64 (or for
>> SWAPGS_UNSAFE_STACK, for that matter).

On Xen, it goes to xen_sysret64, which touches the same percpu
variables that we touch on entry. So I still like my percpu vmap
fault hypothesis, even though I don't understand what would trigger
it.

At the risk of asking awful questions, what happens if we deliver an
IST interrupt in vmx_handle_external_intr? Can that happen? It can't
be a good thing if it happens.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Suman Anna: "Re: [PATCH v8 0/4] hwspinlock core & omap dt support"
Previous message: Scott Branden: "[PATCH v2] dt-bindings: brcm: rationalize Broadcom documentation naming"
In reply to: Denys Vlasenko: "Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?"
Next in thread: Denys Vlasenko: "Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]