Re: [PATCH v4 15/29] x86/mm/64: Enable vmapped stacks

From: Brian Gerst
Date: Mon Jun 27 2016 - 13:23:56 EST


On Mon, Jun 27, 2016 at 1:09 PM, Brian Gerst <brgerst@xxxxxxxxx> wrote:
> On Mon, Jun 27, 2016 at 12:35 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>> On Mon, Jun 27, 2016 at 9:17 AM, Brian Gerst <brgerst@xxxxxxxxx> wrote:
>>> On Mon, Jun 27, 2016 at 11:54 AM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>>> On Mon, Jun 27, 2016 at 8:22 AM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>>>> On Mon, Jun 27, 2016 at 8:12 AM, Brian Gerst <brgerst@xxxxxxxxx> wrote:
>>>>>> On Mon, Jun 27, 2016 at 11:01 AM, Brian Gerst <brgerst@xxxxxxxxx> wrote:
>>>>>>> On Sun, Jun 26, 2016 at 5:55 PM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>>>>>>>> #ifdef CONFIG_X86_64
>>>>>>>> /* Runs on IST stack */
>>>>>>>> dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
>>>>>>>> {
>>>>>>>> static const char str[] = "double fault";
>>>>>>>> struct task_struct *tsk = current;
>>>>>>>> +#ifdef CONFIG_VMAP_STACK
>>>>>>>> + unsigned long cr2;
>>>>>>>> +#endif
>>>>>>>>
>>>>>>>> #ifdef CONFIG_X86_ESPFIX64
>>>>>>>> extern unsigned char native_irq_return_iret[];
>>>>>>>> @@ -332,6 +350,20 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
>>>>>>>> tsk->thread.error_code = error_code;
>>>>>>>> tsk->thread.trap_nr = X86_TRAP_DF;
>>>>>>>>
>>>>>>>> +#ifdef CONFIG_VMAP_STACK
>>>>>>>> + /*
>>>>>>>> + * If we overflow the stack into a guard page, the CPU will fail
>>>>>>>> + * to deliver #PF and will send #DF instead. CR2 will contain
>>>>>>>> + * the linear address of the second fault, which will be in the
>>>>>>>> + * guard page below the bottom of the stack.
>>>>>>>> + */
>>>>>>>> + cr2 = read_cr2();
>>>>>>>> + if ((unsigned long)tsk->stack - 1 - cr2 < PAGE_SIZE)
>>>>>>>> + handle_stack_overflow(
>>>>>>>> + "kernel stack overflow (double-fault)",
>>>>>>>> + regs, cr2);
>>>>>>>> +#endif
>>>>>>>
>>>>>>> Is there any other way to tell if this was from a page fault? If it
>>>>>>> wasn't a page fault then CR2 is undefined.
>>>>>>
>>>>>> I guess it doesn't really matter, since the fault is fatal either way.
>>>>>> The error message might be incorrect though.
>>>>>>
>>>>>
>>>>> It's at least worth a comment, though. Maybe I should check if
>>>>> regs->rsp is within 40 bytes of the bottom of the stack, too, such
>>>>> that delivery of an inner fault would have double-faulted assuming the
>>>>> inner fault didn't use an IST vector.
>>>>>
>>>>
>>>> How about:
>>>>
>>>> /*
>>>> * If we overflow the stack into a guard page, the CPU will fail
>>>> * to deliver #PF and will send #DF instead. CR2 will contain
>>>> * the linear address of the second fault, which will be in the
>>>> * guard page below the bottom of the stack.
>>>> *
>>>> * We're limited to using heuristics here, since the CPU does
>>>> * not tell us what type of fault failed and, if the first fault
>>>> * wasn't a page fault, CR2 may contain stale garbage. To mostly
>>>> * rule out garbage, we check if the saved RSP is close enough to
>>>> * the bottom of the stack to cause exception delivery to fail.
>>>> * The worst case is 7 stack slots: one for alignment, five for
>>>> * SS..RIP, and one for the error code.
>>>> */
>>>> tsk_stack = (unsigned long)task_stack_page(tsk);
>>>> if (regs->rsp <= tsk_stack + 7*8 && regs->rsp > tsk_stack - PAGE_SIZE) {
>>>> /* A double-fault due to #PF delivery failure is plausible. */
>>>> cr2 = read_cr2();
>>>> if (tsk_stack - 1 - cr2 < PAGE_SIZE)
>>>> handle_stack_overflow(
>>>> "kernel stack overflow (double-fault)",
>>>> regs, cr2);
>>>> }
>>>
>>> I think RSP anywhere in the guard page would be best, since it could
>>> have been decremented by a function prologue into the guard page
>>> before an access that triggers the page fault.
>>>
>>
>> I think that can miss some stack overflows. Suppose that RSP points
>> very close to the bottom of the stack and we take an unrelated fault.
>> The CPU can fail to deliver that fault and we get a double fault
>> instead. But I counted wrong, too. Do you like this version and its
>> explanation?
>>
>> /*
>> * If we overflow the stack into a guard page, the CPU will fail
>> * to deliver #PF and will send #DF instead. Similarly, if we
>> * take any non-IST exception while too close to the bottom of
>> * the stack, the processor will get a page fault while
>> * delivering the exception and will generate a double fault.
>> *
>> * According to the SDM (footnote in 6.15 under "Interrupt 14 -
>> * Page-Fault Exception (#PF):
>> *
>> * Processors update CR2 whenever a page fault is detected. If a
>> * second page fault occurs while an earlier page fault is being
>> * deliv- ered, the faulting linear address of the second fault will
>> * overwrite the contents of CR2 (replacing the previous
>> * address). These updates to CR2 occur even if the page fault
>> * results in a double fault or occurs during the delivery of a
>> * double fault.
>> *
>> * However, if we got here due to a non-page-fault exception while
>> * delivering a non-page-fault exception, CR2 may contain a
>> * stale value.
>> *
>> * As a heuristic: we consider this double fault to be a stack
>> * overflow if CR2 points to the guard page and RSP is either
>> * in the guard page or close enough to the bottom of the stack
>> *
>> * We're limited to using heuristics here, since the CPU does
>> * not tell us what type of fault failed and, if the first fault
>> * wasn't a page fault, CR2 may contain stale garbage. To
>> * mostly rule out garbage, we check if the saved RSP is close
>> * enough to the bottom of the stack to cause exception delivery
>> * to fail. If RSP == tsk_stack + 48 and we take an exception,
>> * the stack is already aligned and there will be enough room
>> * SS, RSP, RFLAGS, CS, RIP, and a possible error code. With
>> * any less space left, exception delivery could fail.
>> */
>> tsk_stack = (unsigned long)task_stack_page(tsk);
>> if (regs->rsp < tsk_stack + 48 && regs->rsp > tsk_stack - PAGE_SIZE) {
>> /* A double-fault due to #PF delivery failure is plausible. */
>> cr2 = read_cr2();
>> if (tsk_stack - 1 - cr2 < PAGE_SIZE)
>> handle_stack_overflow(
>> "kernel stack overflow (double-fault)",
>> regs, cr2);
>> }
>
> Actually I think your quote from the SDM contradicts this. The second
> #PF (when trying to invoke the page fault handler) would update CR2
> with an address in the guard page.

Nevermind, I'm totally misunderstanding this. I guess the real
question is when is RSP updated? During each push or once before or
after the frame is pushed. Checking for the bottom of the valid stack
covers the case when RSP isn't updated until after the whole frame is
pushed successfully.

--
Brian Gerst