Re: frequent lockups in 3.18rc4

From: Andy Lutomirski
Date: Thu Nov 20 2014 - 01:17:20 EST


On Wed, Nov 19, 2014 at 6:42 PM, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Wed, Nov 19, 2014 at 5:16 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>
>> And you were calling me crazy? :)
>
> Hey, I'm crazy like a fox.
>
>> We could be restarting just about anything if that happens. Except
>> that if we double-faulted on a trap gate entry instead of an interrupt
>> gate entry, then we can't restart, and, unless we can somehow decode
>> the error code usefully (it's woefully undocumented), int 0x80 and
>> int3 might be impossible to handle correctly if it double-faults. And
>> please don't suggest moving int 0x80 to an IST stack :)
>
> No, no. So tell me if this won't work:
>
> - when forking a new process, make sure we allocate the vmalloc stack
> *before* we copy the vm
>
> - this should guarantee that all new processes will at least have its
> *own* stack always in its page tables, since vmalloc always fills in
> the page table of the current page tables of the thread doing the
> vmalloc.

This gets interesting for kernel threads that don't really have an mm
in the first place, though.

>
> HOWEVER, that leaves the task switch *to* that process, and making
> sure that the stack pointer is ok in between the "switch %rsp" and
> "switch %cr3".
>
> So then we make the rule be: switch %cr3 *before* switching %rsp, and
> only in between those places can we get in trouble. Yes/no?
>

Kernel threads aside, sure. And we do it in this order anyway, I think.

> And that small section is all with interrupts disabled, and nothing
> should take an exception. The C code might take a double fault on a
> regular access to the old stack (the *new* stack is guaranteed to be
> mapped, but the old stack is not), but that should be very similar to
> what we already do with "iret". So we can just fill in the page tables
> and return.

Unless we try to dump the stack from an NMI or something, but that
should be fine regardless.

>
> For safety, add a percpu counter that is cleared before the %cr3
> setting, to make sure that we only do a *single* double-fault, but it
> really sounds pretty safe. No?

I wouldn't be surprised if that's just as expensive as just fixing up
the pgd in the first place. The fixup is just:

if (unlikely(pte_none(mm->pgd[pgd_address(rsp)]))) fix it;

or something like that.

>
> The only deadly thing would be NMI, but that's an IST anyway, so not
> an issue. No other traps should be able to happen except the double
> page table miss.
>
> But hey, maybe I'm not crazy like a fox. Maybe I'm just plain crazy,
> and I missed something else.

I actually kind of like it, other than the kernel thread issue.

We should arguably ditch lazy mm for kernel threads in favor of PCID,
but that's a different story. Or we could beg Intel to give us
separate kernel and user page table hierarchies.

--Andy

>
> And no, I don't think the above is necessarily a *good* idea. But it
> doesn't seem really overly complicated either.
>
> Linus



--
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/