When entering the kernel from userland, use the per-task PTI stack
instead of the per-cpu trampoline stack. Like the trampoline stack,
the PTI stack is mapped both in the kernel and in the user page-table.
Using a per-task stack which is mapped into the kernel and the user
page-table instead of a per-cpu stack will allow executing more code
before switching to the kernel stack and to the kernel page-table.
Why?
When executing more code in the kernel, we are likely to reach a point
where we need to sleep while we are using the user page-table, so we need
to be using a per-thread stack.
I can't immediately evaluate how nasty the page table setup is because
it's not in this patch.
The page-table is the regular page-table as introduced by PTI. It is just
augmented with a few additional mapping which are in patch 11 (x86/pti:
Extend PTI user mappings).
But AFAICS the only thing that this enables is sleeping with user pagetables.
That's precisely the point, it allows to sleep with the user page-table.
Do we really need to do that?
Actually, probably not with this particular patchset, because I do the page-table
switch at the very beginning and end of the C handler. I had some code where I
moved the page-table switch deeper in the kernel handler where you definitively
can sleep (for example, if you switch back to the user page-table before
exit_to_user_mode_prepare()).
So a first step should probably be to not introduce the per-task PTI trampoline stack,
and stick with the existing trampoline stack. The per-task PTI trampoline stack can
be introduced later when the page-table switch is moved deeper in the C handler and
we can effectively sleep while using the user page-table.
Seems reasonable.
I finally remember why I have introduced a per-task PTI trampoline stack right now:
that's to be able to move the CR3 switch anywhere in the C handler. To do so, we need
a per-task stack to enter (and return) from the C handler as the handler can potentially
go to sleep.
Without a per-task trampoline stack, we would be limited to call the switch CR3 functions
from the assembly entry code before and after calling the C function handler (also called
from assembly).
The noinstr part of the C entry code won't sleep.
But the noinstr part of the handler can sleep, and if it does we will need to
preserve the trampoline stack (even if we switch to the per-task kernel stack to
execute the noinstr part).
Example:
#define DEFINE_IDTENTRY(func) \
static __always_inline void __##func(struct pt_regs *regs); \
\
__visible noinstr void func(struct pt_regs *regs) \
{ \
irqentry_state_t state; -+ \
| \
user_pagetable_escape(regs); | use trampoline stack (1)
state = irqentry_enter(regs); | \
instrumentation_begin(); -+ \
run_idt(__##func, regs); |===| run __func() on kernel stack (this can sleep)
instrumentation_end(); -+ \
irqentry_exit(regs, state); | use trampoline stack (2)
user_pagetable_return(regs); -+ \
}
Between (1) and (2) we need to preserve and use the same trampoline stack
in case __func() went sleeping.
Why? Right now, we have the percpu entry stack, and we do just fine
if we enter on one percpu stack and exit from a different one.
We would need to call from asm to C on the entry stack, return back to
asm, and then switch stacks.
That's the problem: I didn't want to return back to asm, so that the pagetable
switch can be done anywhere in the C handler.
So yes, returning to asm to switch the stack is the solution if we want to avoid
having per-task trampoline stack. The drawback is that this forces to do the
page-table switch at the beginning and end of the handler; the pagetable switch
cannot be moved deeper down into the C handler.
But that's probably a good first step (effectively just moving CR3 switch to C
without adding per-task trampoline stack). I will update the patches to do that,
and we can defer the per-task trampoline stack to later if there's an effective
need for it.