Re: [PATCH v2] x86/entry: Avoid redundant CR3 write on paranoid returns

From: Dave Hansen
Date: Fri Oct 27 2023 - 16:12:29 EST


On 9/20/23 08:04, Brendan Jackman wrote:
> From: Lai Jiangshan <laijs@xxxxxxxxxxxxxxxxx>
>
> This path gets used called from:
>
> 1. #NMI return.
> 2. paranoid_exit (i.e. #MCE, #VC, #DB and #DF return)
>
> Contrary to the implication in commit 21e94459110252 ("x86/mm: Optimize
> RESTORE_CR3"), we never modify CR3 in any of these exceptions, except
> for switching from user to kernel pagetables under PTI. That means that
> most of the time when returning from an exception that interrupted the
> kernel no CR3 restore is necessary. Writing CR3 is expensive on some
> machines, so this commit avoids redundant writes.

Please avoid "we's" in changelogs.

One thing that I think is key to this patch, but which is muddled up in
that changelog:

RESTORE_CR3 is *ONLY* used when returning to the kernel.
It must handle user CR3 values because the kernel can run in the
entry/exit code with user CR3 values. That means that restoring
user CR3 values is important functionally but is actually rare
in practice.

That's _not_ obvious from just glancing at RESTORE_CR3 call sites or
reading your changelog. It also makes it obvious why this patch is
worth it: it optimizes the overwhelmingly common case.

> I said "most of the time" because we might have interrupted the kernel
> entry before the user->kernel CR3 switch or the exit after the
> kernel->user switch. In the former case skipping the restore might
> actually be be fine, but definitely not the latter. So we do still need
> to check the saved CR3 and restore it if it's a user CR3.
>
> To reflect the new behaviour RESTORE_CR3 is given a longer name, and a
> comment that was describing its behaviour at the call site is removed.
> We can also simplify the code around the SET_NOFLUSH_BIT invocation
> as we no longer need to branch to it from above.
Also, I don't feel _that_ strongly about the naming, but I'd kinda
rather it be:

PARANOID_RESTORE_CR3

That would at least label it explicitly for these paranoid exit cases.
Sure, this _can_ get used in other contexts theoretically, but it's
really only suitable for the two paranoid exit paths.

It also avoids confusion about whether the "USER" refers to the context
or the CR3 value.

I would probably also give this CR3 function a declared purpose:

/* Restore CR3 from a kernel context. May restore a user CR3 value. */