Re: [PATCH v2 07/10] x86/entry: Vastly simplify SYSENTER TF handling

From: Brian Gerst
Date: Sun Mar 06 2016 - 12:30:40 EST


On Sun, Mar 6, 2016 at 11:55 AM, Borislav Petkov <bp@xxxxxxxxx> wrote:
> On Sat, Mar 05, 2016 at 09:52:20PM -0800, Andy Lutomirski wrote:
>> Due to a blatant design error, SYSENTER doesn't clear TF. As a result,
>> if a user does SYSENTER with TF set, we will single-step through the
>> kernel until something clears TF. There is absolutely nothing we can
>> do to prevent this short of turning off SYSENTER [1].
>>
>> Simplify the handling considerably with two changes:
>>
>> 1. We already sanitize EFLAGS in SYSENTER to clear NT and AC. We can
>> add TF to that list of flags to sanitize with no overhead whatsoever.
>>
>> 2. Teach do_debug to ignore single-step traps in the SYSENTER prologue.
>>
>> That's all we need to do.
>>
>> Don't get too excited -- our handling is still buggy on 32-bit
>> kernels. There's nothing wrong with the SYSENTER code itself, but
>> the #DB prologue has a clever fixup for traps on the very first
>> instruction of entry_SYSENTER_32, and the fixup doesn't work quite
>> correctly. The next two patches will fix that.
>>
>> [1] We could probably prevent it by forcing BTF on at all times and
>> making sure we clear TF before any branches in the SYSENTER
>> code. Needless to say, this is a bad idea.
>>
>> Signed-off-by: Andy Lutomirski <luto@xxxxxxxxxx>
>> ---
>> arch/x86/entry/entry_32.S | 42 ++++++++++++++++++++++----------
>> arch/x86/entry/entry_64_compat.S | 9 ++++++-
>> arch/x86/include/asm/proto.h | 15 ++++++++++--
>> arch/x86/kernel/traps.c | 52 +++++++++++++++++++++++++++++++++-------
>> 4 files changed, 94 insertions(+), 24 deletions(-)
>>
>> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
>> index a8c3424c3392..7700cf695d8c 100644
>> --- a/arch/x86/entry/entry_32.S
>> +++ b/arch/x86/entry/entry_32.S
>> @@ -287,7 +287,26 @@ need_resched:
>> END(resume_kernel)
>> #endif
>>
>> - # SYSENTER call handler stub
>> +GLOBAL(__begin_SYSENTER_singlestep_region)
>> +/*
>> + * All code from here through __end_SYSENTER_singlestep_region is subject
>> + * to being single-stepped if a user program sets TF and executes SYSENTER.
>> + * There is absolutely nothing that we can do to prevent this from happening
>> + * (thanks Intel!). To keep our handling of this situation as simple as
>> + * possible, we handle TF just like AC and NT, except that our #DB handler
>> + * will ignore all of the single-step traps generated in this range.
>> + */
>> +
>> +#ifdef CONFIG_XEN
>> +/*
>> + * Xen doesn't set %esp to be precisely what the normal SYSENTER
>> + * entry point expects, so fix it up before using the normal path.
>> + */
>> +ENTRY(xen_sysenter_target)
>> + addl $5*4, %esp /* remove xen-provided frame */
>> + jmp sysenter_past_esp
>> +#endif
>> +
>> ENTRY(entry_SYSENTER_32)
>> movl TSS_sysenter_sp0(%esp), %esp
>> sysenter_past_esp:
>
> Can you do this below (ontop of your diff) and get rid of those
> __{begin,end}_SYSENTER_singlestep_region and __end_entry_SYSENTER_compat
> globals and use the entry_SYSENTER_{32|compat} symbols instead?
>
> From a quick scan, kallsyms can give you the symbol size too so that you
> can compute where it ends:
>
> readelf -a vmlinux | grep entry_SYSENTER
> 55454: ffffffff8170aff0 99 FUNC GLOBAL DEFAULT 1 entry_SYSENTER_compat
> 62596: ffffffff8170b053 0 NOTYPE GLOBAL DEFAULT 1 __end_entry_SYSENTER_comp
>
> 0xffffffff8170aff0 + 99 = 0xffffffff8170b053
>
> ---
> Index: b/arch/x86/entry/entry_32.S
> ===================================================================
> --- a/arch/x86/entry/entry_32.S 2016-03-06 17:47:10.059733163 +0100
> +++ b/arch/x86/entry/entry_32.S 2016-03-06 17:46:52.235733410 +0100
> @@ -297,18 +297,13 @@ GLOBAL(__begin_SYSENTER_singlestep_regio
> * will ignore all of the single-step traps generated in this range.
> */
>
> +ENTRY(entry_SYSENTER_32)
> #ifdef CONFIG_XEN
> -/*
> - * Xen doesn't set %esp to be precisely what the normal SYSENTER
> - * entry point expects, so fix it up before using the normal path.
> - */
> -ENTRY(xen_sysenter_target)
> addl $5*4, %esp /* remove xen-provided frame */
> jmp sysenter_past_esp
> -#endif
> -
> -ENTRY(entry_SYSENTER_32)
> +#else
> movl TSS_sysenter_sp0(%esp), %esp
> +#endif
> sysenter_past_esp:
> pushl $__USER_DS /* pt_regs->ss */
> pushl %ebp /* pt_regs->sp (stashed in bp) */


This won't work if you run a Xen enabled kernel on bare metal. This
would work though:

ALTERNATIVE "movl TSS_sysenter_sp0(%esp), %esp", "addl $5*4, %esp",
X86_FEATURE_XENPV

I haven't read the Xen hypervisor code, but what are those 5 words
that were pushed on the stack by the hypervisor? It suspiciously is
the size of an IRET frame. Considering that we don't use SYSEXIT on
Xen anymore, can we just redirect SYSENTER to the INT80 handler?
Perhaps even just disable SYSENTER support in the VDSO on Xen. I
can't imagine SYSENTER is any faster than INT80 on Xen, because it has
to trap to the hypervisor first.

--
Brian Gerst