Re: [RFC 2/2] x86/pti/64: Remove the SYSCALL64 entry trampoline

From: Andy Lutomirski
Date: Sun Jul 22 2018 - 16:59:28 EST



> On Jul 22, 2018, at 11:27 AM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
>> On Sun, Jul 22, 2018 at 10:45 AM Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>>
>> This patch changes the code to map the percpu TSS into the user page
>> tables to allow the non-trampoline SYSCALL64 path to work under PTI.
>
> Me likey.
>
> However:
>
>> This does not add a new direct information leak, since the TSS is
>> readable by Meltdown from the cpu_entry_area alias regardless.
>
> Afaik, it does now potentially expose through meltdown the per-thread
> entry stack info, which is new.

Itâs always been exposed through the RO alias. The only new exposure is the *address* of the RW alias, I think.

>
> But I don't think that's a show-stopper.
>
>> static void __init pti_clone_user_shared(void)
>> {
>> + for_each_possible_cpu(cpu) {
>
> But this code is pretty disgusting and seems wrong.
>
> Do you really want to do all trhe _possible_ cpu's, not just the
> online ones? I'd rather expose less (think MAXCPU) and then have the
> CPU hotplug code expose the page as the CPU comes up?

We already have exactly the same issue for cpu_entry_area. If we change it, I think we should do cpu_entry_area at the same time. But thatâs awkward because cpu_entry_area is mapped one PMD at a time right now.

Itâs also awkward to expose a percpu page dynamically, because (I think) percpu data isnât guaranteed to all be in the same PGD-sized area. A vmalloc fault in the early SYSCALL64 path is fatal.

>
>> + unsigned long va = (unsigned long)&per_cpu(cpu_tss_rw, cpu);
>> + phys_addr_t pa = per_cpu_ptr_to_phys((void *)va);
>> + pte_t *target_pte;
>> +
>> + target_pte = pti_user_pagetable_walk_pte(va);
>
> This function only exists if CONFIG_X86_VSYSCALL_EMULATION, so it
> won't even compile under (very unusual) configurations.

Oops.

>
> The "disgusting" part is that I think it could/should share more code
> with the vsyscall case, and the whole target-pte checking and setting
> should be shared too.

I tried that. It was uglier. The percpu code wants to make up a new PTE because the real kernel mapping uses large pages. The vsyscall code wants to copy a PTE because itâs really a PTE and it has unusual permissions.

>
> Beause not being shared, I react to this:
>
>> + set_pte(target_pte, pfn_pte(pa >> PAGE_SHIFT, PAGE_KERNEL));
>
> Hmm. The vsyscall code just does
>
> *target_pte = ..
>
> without any set_pte() stuff. Do we want/need the PVOP cases, and if
> so, why doesn't the vsyscall case need it?

It doesnât need it. I could use plain assignment.

>
> Anyway, I love the approach, and how this gets rid of the nasty
> trampoline, so no real complaints, just "this needs some fixups".
>
>

Iâll do the fixups. I think that, if we want to unmap the pages for CPUs that arenât present, that should be a separate patch. Iâm also not convinced it adds much value.

In general, PTI is fairly crappy, and it leaks all kinds of information. I suspect the worst leak is the NMI stack for local and remote CPUs. Fixing *that* is going to be fugly, but may actually be important, because I can easily imagine malicious user code that causes arbitrary kernel memory to get read and spilled on the NMI stack.

What we *should* do IMO is defer allocation of percpu space for not-present CPUs to save a bunch of memory. But thatâs a major change and will probably break things.