Re: [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline

From: Andy Lutomirski
Date: Sat Sep 08 2018 - 00:36:22 EST


On Fri, Sep 7, 2018 at 9:40 AM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
> On Mon, Sep 03, 2018 at 03:59:44PM -0700, Andy Lutomirski wrote:
>> The SYSCALL64 trampoline has a couple of nice properties:
>>
>> - The usual sequence of SWAPGS followed by two GS-relative accesses to
>> set up RSP is somewhat slow because the GS-relative accesses need
>> to wait for SWAPGS to finish. The trampoline approach allows
>> RIP-relative accesses to set up RSP, which avoids the stall.
>>
>> - The trampoline avoids any percpu access before CR3 is set up,
>> which means that no percpu memory needs to be mapped in the user
>> page tables. This prevents using Meltdown to read any percpu memory
>> outside the cpu_entry_area and prevents using timing leaks
>> to directly locate the percpu areas.
>>
>> The downsides of using a trampoline may outweigh the upsides, however.
>> It adds an extra non-contiguous I$ cache line to system calls, and it
>> forces an indirect jump to transfer control back to the normal kernel
>> text after CR3 is set up. The latter is because x86 lacks a 64-bit
>> direct jump instruction that could jump from the trampoline to the entry
>> text. With retpolines enabled, the indirect jump is extremely slow.
>>
>> This patch changes the code to map the percpu TSS into the user page
>> tables to allow the non-trampoline SYSCALL64 path to work under PTI.
>> This does not add a new direct information leak, since the TSS is
>> readable by Meltdown from the cpu_entry_area alias regardless. It
>> does allow a timing attack to locate the percpu area, but KASLR is
>> more or less a lost cause against local attack on CPUs vulnerable to
>> Meltdown regardless. As far as I'm concerned, on current hardware,
>> KASLR is only useful to mitigate remote attacks that try to attack
>> the kernel without first gaining RCE against a vulnerable user
>> process.
>>
>> On Skylake, with CONFIG_RETPOLINE=y and KPTI on, this reduces
>> syscall overhead from ~237ns to ~228ns.
>>
>> There is a possible alternative approach: we could instead move the
>> trampoline within 2G of the entry text and make a separate copy for
>> each CPU. Then we could use a direct jump to rejoin the normal
>> entry path.
>>
>> Signed-off-by: Andy Lutomirski <luto@xxxxxxxxxx>
>
> The following commit should also be reverted:
>
> 4d99e4136580 ("perf machine: Workaround missing maps for x86 PTI entry trampolines")

I think we shouldn't, since the perf folks want new perf versions to
be fully functional on older kernels. My plan was to let the perf
crew do whatever reversions they felt appropriate.