Re: [PATCH v2 07/10] x86/entry: Vastly simplify SYSENTER TF handling

From: Andy Lutomirski
Date: Sun Mar 06 2016 - 12:37:18 EST


On Sun, Mar 6, 2016 at 9:30 AM, Brian Gerst <brgerst@xxxxxxxxx> wrote:
> On Sun, Mar 6, 2016 at 11:55 AM, Borislav Petkov <bp@xxxxxxxxx> wrote:
>> On Sat, Mar 05, 2016 at 09:52:20PM -0800, Andy Lutomirski wrote:
>>> Due to a blatant design error, SYSENTER doesn't clear TF. As a result,
>>> if a user does SYSENTER with TF set, we will single-step through the
>>> kernel until something clears TF. There is absolutely nothing we can
>>> do to prevent this short of turning off SYSENTER [1].
>>>
>>> Simplify the handling considerably with two changes:
>>>
>>> 1. We already sanitize EFLAGS in SYSENTER to clear NT and AC. We can
>>> add TF to that list of flags to sanitize with no overhead whatsoever.
>>>
>>> 2. Teach do_debug to ignore single-step traps in the SYSENTER prologue.
>>>
>>> That's all we need to do.
>>>
>>> Don't get too excited -- our handling is still buggy on 32-bit
>>> kernels. There's nothing wrong with the SYSENTER code itself, but
>>> the #DB prologue has a clever fixup for traps on the very first
>>> instruction of entry_SYSENTER_32, and the fixup doesn't work quite
>>> correctly. The next two patches will fix that.
>>>
>>> [1] We could probably prevent it by forcing BTF on at all times and
>>> making sure we clear TF before any branches in the SYSENTER
>>> code. Needless to say, this is a bad idea.
>>>
>>> Signed-off-by: Andy Lutomirski <luto@xxxxxxxxxx>
>>> ---
>>> arch/x86/entry/entry_32.S | 42 ++++++++++++++++++++++----------
>>> arch/x86/entry/entry_64_compat.S | 9 ++++++-
>>> arch/x86/include/asm/proto.h | 15 ++++++++++--
>>> arch/x86/kernel/traps.c | 52 +++++++++++++++++++++++++++++++++-------
>>> 4 files changed, 94 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
>>> index a8c3424c3392..7700cf695d8c 100644
>>> --- a/arch/x86/entry/entry_32.S
>>> +++ b/arch/x86/entry/entry_32.S
>>> @@ -287,7 +287,26 @@ need_resched:
>>> END(resume_kernel)
>>> #endif
>>>
>>> - # SYSENTER call handler stub
>>> +GLOBAL(__begin_SYSENTER_singlestep_region)
>>> +/*
>>> + * All code from here through __end_SYSENTER_singlestep_region is subject
>>> + * to being single-stepped if a user program sets TF and executes SYSENTER.
>>> + * There is absolutely nothing that we can do to prevent this from happening
>>> + * (thanks Intel!). To keep our handling of this situation as simple as
>>> + * possible, we handle TF just like AC and NT, except that our #DB handler
>>> + * will ignore all of the single-step traps generated in this range.
>>> + */
>>> +
>>> +#ifdef CONFIG_XEN
>>> +/*
>>> + * Xen doesn't set %esp to be precisely what the normal SYSENTER
>>> + * entry point expects, so fix it up before using the normal path.
>>> + */
>>> +ENTRY(xen_sysenter_target)
>>> + addl $5*4, %esp /* remove xen-provided frame */
>>> + jmp sysenter_past_esp
>>> +#endif
>>> +
>>> ENTRY(entry_SYSENTER_32)
>>> movl TSS_sysenter_sp0(%esp), %esp
>>> sysenter_past_esp:
>>
>> Can you do this below (ontop of your diff) and get rid of those
>> __{begin,end}_SYSENTER_singlestep_region and __end_entry_SYSENTER_compat
>> globals and use the entry_SYSENTER_{32|compat} symbols instead?
>>
>> From a quick scan, kallsyms can give you the symbol size too so that you
>> can compute where it ends:
>>
>> readelf -a vmlinux | grep entry_SYSENTER
>> 55454: ffffffff8170aff0 99 FUNC GLOBAL DEFAULT 1 entry_SYSENTER_compat
>> 62596: ffffffff8170b053 0 NOTYPE GLOBAL DEFAULT 1 __end_entry_SYSENTER_comp
>>
>> 0xffffffff8170aff0 + 99 = 0xffffffff8170b053
>>
>> ---
>> Index: b/arch/x86/entry/entry_32.S
>> ===================================================================
>> --- a/arch/x86/entry/entry_32.S 2016-03-06 17:47:10.059733163 +0100
>> +++ b/arch/x86/entry/entry_32.S 2016-03-06 17:46:52.235733410 +0100
>> @@ -297,18 +297,13 @@ GLOBAL(__begin_SYSENTER_singlestep_regio
>> * will ignore all of the single-step traps generated in this range.
>> */
>>
>> +ENTRY(entry_SYSENTER_32)
>> #ifdef CONFIG_XEN
>> -/*
>> - * Xen doesn't set %esp to be precisely what the normal SYSENTER
>> - * entry point expects, so fix it up before using the normal path.
>> - */
>> -ENTRY(xen_sysenter_target)
>> addl $5*4, %esp /* remove xen-provided frame */
>> jmp sysenter_past_esp
>> -#endif
>> -
>> -ENTRY(entry_SYSENTER_32)
>> +#else
>> movl TSS_sysenter_sp0(%esp), %esp
>> +#endif
>> sysenter_past_esp:
>> pushl $__USER_DS /* pt_regs->ss */
>> pushl %ebp /* pt_regs->sp (stashed in bp) */
>
>
> This won't work if you run a Xen enabled kernel on bare metal. This
> would work though:
>
> ALTERNATIVE "movl TSS_sysenter_sp0(%esp), %esp", "addl $5*4, %esp",
> X86_FEATURE_XENPV

That might work.

>
> I haven't read the Xen hypervisor code, but what are those 5 words
> that were pushed on the stack by the hypervisor? It suspiciously is
> the size of an IRET frame.

I think it is nominally an IRET frame. One might wonder what's in it,
given that the hypervisor doesn't know what the old IP or SP was.

> Considering that we don't use SYSEXIT on
> Xen anymore, can we just redirect SYSENTER to the INT80 handler?
> Perhaps even just disable SYSENTER support in the VDSO on Xen. I
> can't imagine SYSENTER is any faster than INT80 on Xen, because it has
> to trap to the hypervisor first.
>

I think we should leave it enabled -- having user programs on Xen PV
trap into the hypervisor for a system call using SYSENTER will still
be much faster than using INT $0x80 as long as the hypervisor does
something reasonable.

I think that Andrew Cooper has some patches in the works to clean a
bunch of the Xen PV entry prologue stuff up.

--Andy