Re: [PATCH V2] x86/entry/64: De-Xen-ify our NMI code further

From: Andy Lutomirski
Date: Mon Jan 25 2021 - 12:55:11 EST



> On Jan 25, 2021, at 9:39 AM, Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
>
> On Mon, 25 Jan 2021 15:45:06 +0800
> Lai Jiangshan <jiangshanlai@xxxxxxxxx> wrote:
>
>> From: Lai Jiangshan <laijs@xxxxxxxxxxxxxxxxx>
>>
>> The commit 929bacec21478("x86/entry/64: De-Xen-ify our NMI code") simplified
>> the NMI code by changing paravirt code into native code and left a comment
>> about "inspecting RIP instead". But until now, "inspecting RIP instead"
>> has not been made happened and this patch tries to complete it.
>>
>> Comments in the code was from Andy Lutomirski. Thanks!
>>
>> Signed-off-by: Lai Jiangshan <laijs@xxxxxxxxxxxxxxxxx>
>> ---
>> arch/x86/entry/entry_64.S | 44 ++++++++++-----------------------------
>> 1 file changed, 11 insertions(+), 33 deletions(-)
>>
>> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>> index cad08703c4ad..21f67ea62341 100644
>> --- a/arch/x86/entry/entry_64.S
>> +++ b/arch/x86/entry/entry_64.S
>> @@ -1268,32 +1268,14 @@ SYM_CODE_START(asm_exc_nmi)
>> je nested_nmi
>>
>> /*
>> - * Now test if the previous stack was an NMI stack. This covers
>> - * the case where we interrupt an outer NMI after it clears
>> - * "NMI executing" but before IRET. We need to be careful, though:
>> - * there is one case in which RSP could point to the NMI stack
>> - * despite there being no NMI active: naughty userspace controls
>> - * RSP at the very beginning of the SYSCALL targets. We can
>> - * pull a fast one on naughty userspace, though: we program
>> - * SYSCALL to mask DF, so userspace cannot cause DF to be set
>> - * if it controls the kernel's RSP. We set DF before we clear
>> - * "NMI executing".
>> + * Now test if we interrupted an outer NMI that just cleared "NMI
>> + * executing" and is about to IRET. This is a single-instruction
>> + * window. This check does not handle the case in which we get a
>> + * nested interrupt (#MC, #VE, #VC, etc.) after clearing
>> + * "NMI executing" but before the outer NMI executes IRET.
>> */
>> - lea 6*8(%rsp), %rdx
>> - /* Compare the NMI stack (rdx) with the stack we came from (4*8(%rsp)) */
>> - cmpq %rdx, 4*8(%rsp)
>> - /* If the stack pointer is above the NMI stack, this is a normal NMI */
>> - ja first_nmi
>> -
>> - subq $EXCEPTION_STKSZ, %rdx
>> - cmpq %rdx, 4*8(%rsp)
>> - /* If it is below the NMI stack, it is a normal NMI */
>> - jb first_nmi
>> -
>> - /* Ah, it is within the NMI stack. */
>> -
>> - testb $(X86_EFLAGS_DF >> 8), (3*8 + 1)(%rsp)
>> - jz first_nmi /* RSP was user controlled. */
>
> So we no longer check to see if the current stack is on the NMI stack.
> Makes sense, since this beginning of the NMI code can not be interrupted,
> as there's no breakpoints or faults that can occur when that happens. The
> $nmi_executing is set in all other locations except for:
>
> repeat_nmi - end_repeat_nmi
> and for the iretq itself (.Lnmi_iret).
>
> Thus, by just checking the nmi_executing variable on the stack along with
> the RIP compared to repeat_nim-end_repeat_nmi + .Lnmi_iret, we can safely
> tell if the NMI is nested or not.
>
> I was working on rewriting the beginning comments to reflect these updates,
> and discovered a possible bug that exists (unrelated to this patch).
>
>> + cmpq $.Lnmi_iret, 8(%rsp)
>> + jne first_nmi
>>
>
> On triggering an NMI from user space, I see the switch to the thread stack
> is done, and "exc_nmi" is called.
>
> The problem I see with this is that exc_nmi is called with the thread
> stack, if it were to take an exception, NMIs would be enabled allowing for
> a nested NMI to run. From what I can tell, I don't see anything stopping
> that NMI from executing over the currently running NMI. That is, this means
> that NMI handlers are now re-entrant.

That was intentional in my part. The C code checks for this condition and handles it, just like it does on 32-bit kernels.