Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

From: Alexandre Chartre
Date: Wed Nov 18 2020 - 12:19:53 EST



On 11/18/20 2:22 PM, David Laight wrote:
From: Alexandre Chartre
Sent: 18 November 2020 10:30
...
Correct, this RFC is not changing the overhead. However, it is a step forward
for being able to execute some selected syscalls or interrupt handlers without
switching to the kernel page-table. The next step would be to identify and add
the necessary mapping to the user page-table so that specified syscalls can be
executed without switching the page-table.

Remember that without PTI user space can read all kernel memory.
(I'm not 100% sure you can force a cache-line read.)
It isn't even that slow.
(Even I can understand how it works.)

So if you are worried about user space doing that you can't really
run anything on the user page tables.

Yes, without PTI, userspace can read all kernel memory. But to run some
part of the kernel you don't need to have all kernel mappings. Also a lot
of the kernel contain non-sensitive information which can be safely expose
to userspace. So there's probably some room for running carefully selected
syscalls with the user page-table (and hopefully useful ones).

System calls like getpid() are irrelevant - they aren't used (much).
Even the time of day ones are implemented in the VDSO without a
context switch.

getpid()/getppid() is interesting because it provides the amount of overhead
PTI is adding. But the impact can be more important if some TLB flushing are
also required (as you mentioned below).


So the overheads come from other system calls that 'do work'
without actually sleeping.
I'm guessing things like read, write, sendmsg, recvmsg.

The only interesting system call I can think of is futex.
As well as all the calls that return immediately because the
mutex has been released while entering the kernel, I suspect
that being pre-empted by a different thread (of the same process)
doesn't actually need CR3 reloading (without PTI).

I also suspect that it isn't just the CR3 reload that costs.
There could (depending on the cpu) be associated TLB and/or cache
invalidations that have a much larger effect on programs with
large working sets than on simple benchmark programs.

Right, although the TLB flush is mitigated with PCID, but this has
more impact if there's no PCID.


Now bits of data that you are 'more worried about' could be kept
in physical memory that isn't normally mapped (or referenced by
a TLB) and only mapped when needed.
But that doesn't help the general case.


Note that having syscall which could be done without switching the
page-table is just one benefit you can get from this RFC. But the main
benefit is for integrating Address Space Isolation (ASI) which will be
much more complex if ASI as to plug in the current assembly CR3 switch.

Thanks,

alex.