Re: PANIC: double fault, error_code: 0x0 in 4.0.0-rc3-2, kvm related?

From: Stefan Seyfried
Date: Wed Mar 18 2015 - 17:42:10 EST


Am 18.03.2015 um 22:21 schrieb Andy Lutomirski:
> On Wed, Mar 18, 2015 at 2:12 PM, Stefan Seyfried
> <stefan.seyfried@xxxxxxxxxxxxxx> wrote:
>> Am 18.03.2015 um 21:51 schrieb Andy Lutomirski:
>>> On Wed, Mar 18, 2015 at 1:05 PM, Stefan Seyfried
>>> <stefan.seyfried@xxxxxxxxxxxxxx> wrote:
>>
>>>>> The relevant thread's stack is here (see ti in the trace):
>>>>>
>>>>> ffff8801013d4000
>>>>>
>>>>> It could be interesting to see what's there.
>>>>>
>>>>> I don't suppose you want to try to walk the paging structures to see
>>>>> if ffff88023bc80000 (i.e. gsbase) and, more specifically,
>>>>> ffff88023bc80000 + old_rsp and ffff88023bc80000 + kernel_stack are
>>>>> present? You'd only have to walk one level -- presumably, if the PGD
>>>>> entry is there, the rest of the entries are okay, too.
>>>>
>>>> That's all greek to me :-)
>>>>
>>>> I see that there is something at ffff88023bc80000:
>>>>
>>>> crash> x /64xg 0xffff88023bc80000
>>>> 0xffff88023bc80000: 0x0000000000000000 0x0000000000000000
>>>> 0xffff88023bc80010: 0x0000000000000000 0x0000000000000000
>>>> 0xffff88023bc80020: 0x0000000000000000 0x000000006686ada9
>>>> 0xffff88023bc80030: 0x0000000000000000 0x0000000000000000
>>>> 0xffff88023bc80040: 0x0000000000000000 0x0000000000000000
>>>> [all zeroes]
>>>> 0xffff88023bc801f0: 0x0000000000000000 0x0000000000000000
>>>>
>>>> old_rsp and kernel_stack seem bogus:
>>>> crash> print old_rsp
>>>> Cannot access memory at address 0xa200
>>>> gdb: gdb request failed: print old_rsp
>>>> crash> print kernel_stack
>>>> Cannot access memory at address 0xaa48
>>>> gdb: gdb request failed: print kernel_stack
>>>>
>>>> kernel_stack is not a pointer? So 0xffff88023bc80000 + 0xaa48 it is:
>>>
>>> Yup. old_rsp and kernel_stack are offsets relative to gsbase.
>>>
>>>>
>>>> crash> x /64xg 0xffff88023bc8aa00
>>>> 0xffff88023bc8aa00: 0x0000000000000000 0x0000000000000000
>>>
>>> [...]
>>>
>>> I don't know enough about crashkernel to know whether the fact that
>>> this worked means anything.
>>
>> AFAIK this just means that the memory at this location is included in
>> the dump :-)
>>
>>> Can you dump the page of physical memory at 0x4779a067? That's the PGD.
>>
>> Unfortunately not, this is a partial dump (I think the default config in
>> openSUSE, but I might have changed it some time ago) and the dump_level
>> is 31 which means that the following are excluded:
>>
>> | |cache |cache | |
>> dump | zero |without|with | user | free
>> level | page |private|private| data | page
>> -------+------+-------+-------+------+------
>> 31 | X | X | X | X | X
>>
>> so this:
>> crash> x /64xg 0x4779a067
>> 0x4779a067: Cannot access memory at address 0x4779a067
>> gdb: gdb request failed: x /64xg
>>
>> probably just means, that the PGD falls in one of the above excluded
>> categories.
>
> I suspect that it actually means that gdb sees virtual addresses, not
> physical addresses. But I screwed up completely -- "PGD" in the dump
> is the PGD *entry*, not the PGD pointer.

in crash, usually physical addresses work (it's a sophisticated wrapper
around gdb AFAICT)
>
> We could plausibly fish it out from current->mm, but that's a mess.

I'll come to that later
I
> don't suppose that "info registers" or "p/x $cr3" will show the cr3
> value?

No, that does not work from crash.

But current->mm is easy:
crash> task|grep mm
start_comm =
"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
mm = 0xffff8800b8a9c040,
active_mm = 0xffff8800b8a9c040,
comm = "qemu-system-x86",

and (guessing the type :-)
crash> print *(struct mm_struct *)0xffff8800b8a9c040|grep pgd
pgd = 0xffff880002d7e000,

But if that's correct, pgd contains all zeroes:
crash> print *(pgd_t *)0xffff880002d7e000
$15 = {
pgd = 0
}
crash> x /16xg 0xffff880002d7e000
0xffff880002d7e000: 0x0000000000000000 0x0000000000000000
0xffff880002d7e010: 0x0000000000000000 0x0000000000000000
0xffff880002d7e020: 0x0000000000000000 0x0000000000000000
0xffff880002d7e030: 0x0000000000000000 0x0000000000000000
0xffff880002d7e040: 0x0000000000000000 0x0000000000000000
0xffff880002d7e050: 0x0000000000000000 0x0000000000000000
0xffff880002d7e060: 0x0000000000000000 0x0000000000000000
0xffff880002d7e070: 0x0000000000000000 0x0000000000000000

> In any case, Denys is right -- my theory doesn't really hold water on
> non-SMAP systems.

Mine is definitely not new enough for this feature :)

Maybe it would be more helpful if Takashi who is able to reproduce this
more reliably than me would do a crash dump, preferably with a lower
dumplevel, to investigate on.
I have seen the bug two or three times in a week or two, which makes
waiting for it to happen a boring experience.

Best regards,

Stefan

--
Stefan Seyfried
Linux Consultant & Developer -- GPG Key: 0x731B665B

B1 Systems GmbH
OsterfeldstraÃe 7 / 85088 Vohburg / http://www.b1-systems.de
GF: Ralph Dehner / Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/