Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

From: Denys Vlasenko
Date: Mon Mar 23 2015 - 13:46:55 EST


On 03/23/2015 06:18 PM, Takashi Iwai wrote:
> At Mon, 23 Mar 2015 17:07:15 +0100, Denys Vlasenko wrote:
>>>> I pulled tip tree on top of 4.0-rc5, built with your patch and now
>>>> succeeded to get a better message:
>>>>
>>>> kvm: zapping shadow pages for mmio generation wraparound
>>>> kvm [5126]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0xffff
>>>> Exception on user stack 00007ffd22c23ef0: RSP: 0018:00007ffd22c23f28 EFLAGS: 00010006
>>>> RIP: 0010:[<ffffffff8162681d>] [<ffffffff8162681d>] netlink_attachskb+0x1d/0x1d0
>>>> PANIC: double fault, error_code: 0x0
>>>> CPU: 1 PID: 10819 Comm: cc1 Tainted: G W 4.0.0-rc5-debug1+ #2
>>>> Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
>>>> task: ffff8800d1b34b10 ti: ffff8800d1b30000 task.ti: ffff8800d1b30000
>>>> RIP: 0010:[<ffffffff8162681d>] [<ffffffff8162681d>] netlink_attachskb+0x1d/0x1d0
>>>> RSP: 0018:00007ffd22c23f28 EFLAGS: 00010006
>>>> RAX: 0000000000000000 RBX: 0000000000000005 RCX: 00000000c0000101
>>>> RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00007ffd22c23ef0

>> FYI: the disassembly of netlink_attachskb (from "Code:" line) is:
>>
>> 0: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
>> 5: 55 push %rbp
>> 6: 48 89 e5 mov %rsp,%rbp
>> 9: 41 56 push %r14
>> b: 41 55 push %r13
>> d: 49 89 d5 mov %rdx,%r13
>> 10: 41 54 push %r12
>> 12: 49 89 f4 mov %rsi,%r12
>> 15: 53 push %rbx
>> 16: 48 89 fb mov %rdi,%rbx
>> 19: 48 83 ec 30 sub $0x30,%rsp
>> 1d: 8b 87 68 01 00 00 mov 0x168(%rdi),%eax
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> 23: 39 87 9c 01 00 00 cmp %eax,0x19c(%rdi)
>> 29: 7c 25 jl 50 <_start+0x50>
>> 2b: 48 8b 87 88 04 00 00 mov 0x488(%rdi),%rax
>>
>> The ^^^^^ instruction is the one which faults. Since you said it
>> consistently happens here, this should be a page fault, not an external
>> hardware interrupt.
>>
>> The code corresponds to the comparison in if():
>>
>> int netlink_attachskb(struct sock *sk, struct sk_buff *skb,
>> long *timeo, struct sock *ssk)
>> {
>> struct netlink_sock *nlk;
>>
>> nlk = nlk_sk(sk);
>>
>> if ((atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||

>>> - Another piece is that the bug happens only when a KVM is running.
>>> The kernel ran without problem over days with similar tasks
>>> (compiling kernel, etc) when no KVM was used.
>>
>> Conceivably virtualization support in CPUs can have nasty erratas.
>> However, you and other reporter have different CPUs - yours
>> is Ivy Bridge, his CPU is a Penryn.
>>
>> I don't see the path how KVM helps to trigger this.
>>
>>> - And now I get the trace as above, pointing netlink_attachskb().
>>>
>>> I have a difficulty to imagine how all these pieces fit into a single
>>> picture. Is something already screwed up before that?
>>
>> Well, a tiny bit more info will be seen if you'd change %rdi
>> to, say, %r15 in these two lines in my patch:
>>
>> /* Save bogus RSP value */
>> movq %rsp,%rdi
>> ...
>> push %rdi /* pt_regs->sp */
>>
>> Then original %rdi will be visible in the crash message.
>
> OK, here we go.
>
> kvm: zapping shadow pages for mmio generation wraparound
> kvm [5490]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0xffff
> Exception on user stack 00007fff1d7e5ec0: RSP: 0018:00007fff1d7e5ef8 EFLAGS: 00010002
> RIP: 0010:[<ffffffff8162681d>] [<ffffffff8162681d>] netlink_attachskb+0x1d/0x1d0
> PANIC: double fault, error_code: 0x0
> CPU: 5 PID: 14285 Comm: fixdep Tainted: G W 4.0.0-rc5-debug1+ #3
> Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
> task: ffff88020ba1c690 ti: ffff880206ba4000 task.ti: ffff880206ba4000
> RIP: 0010:[<ffffffff8162681d>] [<ffffffff8162681d>] netlink_attachskb+0x1d/0x1d0
> RSP: 0018:00007fff1d7e5ef8 EFLAGS: 00010002
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000c0000101
> RDX: 0000000000000000 RSI: 0000000000001ebb RDI: 0000000000000000

Thanks for your testing. So the %rdi was NULL... not very informative.

Notice that your every crash is preceded by

kvm: zapping shadow pages for mmio generation wraparound
kvm [5490]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0xffff

This hints that kvm _is_ somehow responsible.
I'm no expert on kvm, I need to take a look around that code...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/