Re: ia32_sysenter_target does not preserve EFLAGS

From: Andy Lutomirski
Date: Fri Mar 27 2015 - 14:38:22 EST


On Mar 27, 2015 7:26 AM, "Denys Vlasenko" <dvlasenk@xxxxxxxxxx> wrote:
>
> Hi,
>
> While running some tests I noticed that EFLAGS
> is not saved across syscalls if I use 32-bit
> userspace, use SYSENTER, and paravirt is active.
>
> Looking at the code, it's actually clear why that happens.
>
> /*
> * SYSENTER loads ss, rsp, cs, and rip from previously programmed MSRs.
> * IF and VM in rflags are cleared (IOW: interrupts are off).
> * SYSENTER does not save anything on the stack,
> * and does not save old rip (!!!) and rflags.
> */
> ENTRY(ia32_sysenter_target)
> SWAPGS_UNSAFE_STACK <============================
> movq PER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
> ENABLE_INTERRUPTS(CLBR_NONE)
>
> movl %ebp, %ebp
> movl %eax, %eax
> movl ASM_THREAD_INFO(TI_sysenter_return, %rsp, 0), %r10d
>
> /* Construct struct pt_regs on stack */
> pushq_cfi $__USER32_DS /* pt_regs->ss */
> pushq_cfi %rbp /* pt_regs->sp */
> CFI_REL_OFFSET rsp,0
> pushfq_cfi /* pt_regs->flags */
>
> The SWAPGS_UNSAFE_STACK, it's it involves paravirt callbacks,
> will change EFLAGS, and it *can't* save/restore them -
> there is no place to save it, since neither stack nor
> PER_CPU() is usable at that point.
>
> Interestingly, *no one ever complained*!
>
> Apparently, users *don't* depend on arithmetic flags
> to survive over syscall. They also okay with DF flag
> being cleared.
>
> Let's go flag-by-flag.
>
> ID - probably no one depends on it
> VIP,VIF,VM - v86 stuff, not supported in 64bit
> AC - someone probably do use this
> RF - should be cleared to 0
> NT - iret via task gate, not supported in 64bit
> IOPL - usually 00, sys_iopl() can change it
> DF - according to C ABI, should be 0
> IF - should be preserved (but almost always 1)
> TF - should be preserved
> arith flags - probably no one cares
>
> IOW. Bits to be preseved are only AC, IOPL, TF, and _maybe_
> IF.
>
> AC and IOPL are preserved even with this paravirt quirk
> because paravirt hooks do not mangle them.
>
> TF preservation and proper restoration is handled by
> do_debug + syscall_trace_enter_phase2 + iret
> combo.
>
> We unconditionally set IF. This is only a problem for applications
> which use sys_iopl(3) and, disable IRQs in userspace and perform
> syscalls. The set of such apps is probably empty.
> (This "bug" exists even for non-paravirt case).
>
> So, formally, we have a bug: we do not preserve IF,
> DF and arith flags.
>
> I'm proposing to use this opportunity to amend syscall ABI
> to say that arith flags are not preserved across syscalls,
> and DF can be cleared to 0 by syscalls (but can't be set to 1).
> Evidently, it's broken for some time for some virtualized
> setups and users are okay.

I think I'd rather fix it. I want to give x86_64 a sysenter stack
like x86_32's, since AFAICT the only reason that #DF needs to use IST
is because sysenter with TF set is the only way I can see that #DF
could happen with an invalid stack.

Also, Houston, we have a bug, probably rootable, and probably damn
near impossible to exploit without crashing your system:

User does sysenter. We end up in native_irq_enable_sysexit. We do:

swapgs
sti

<-- NMI here can happen on some (all?) cpus, returns successfully
*with interrupts unmasked*

<-- IRQ. Boom

My preferred fix would be to use sysretl instead of sysexit. As far
as I know, there are no 64-bit CPUs at all that don't support sysretl.

--Andy

>
> I'm not sure what to do with the "bug" of forcing IF=1.
> Fix it? Or also declare that syscalls can set IF=1?
> Do you think this is a legitimate userspace code?
>
> sys_iopl(3);
> cli;
> syscall();
> /* expects irqs still disabled */
>
> --
> vda
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/