Re: [RFC PATCH v8 09/10] context_tracking,x86: Defer kernel text patching IPIs when tracking CR3 switches

From: Frederic Weisbecker

Date: Wed Apr 15 2026 - 08:11:46 EST


Le Tue, Mar 24, 2026 at 10:48:00AM +0100, Valentin Schneider a écrit :
> text_poke_bp_batch() sends IPIs to all online CPUs to synchronize
> them vs the newly patched instruction. CPUs that are executing in userspace
> do not need this synchronization to happen immediately, and this is
> actually harmful interference for NOHZ_FULL CPUs.
>
> As the synchronization IPIs are sent using a blocking call, returning from
> text_poke_bp_batch() implies all CPUs will observe the patched
> instruction(s), and this should be preserved even if the IPI is deferred.
> In other words, to safely defer this synchronization, any kernel
> instruction leading to the execution of the deferred instruction
> sync must *not* be mutable (patchable) at runtime.
>
> This means we must pay attention to mutable instructions in the early entry
> code:
> - alternatives
> - static keys
> - static calls
> - all sorts of probes (kprobes/ftrace/bpf/???)
>
> The early entry code is noinstr, which gets rid of the probes.
>
> Alternatives are safe, because it's boot-time patching (before SMP is
> even brought up) which is before any IPI deferral can happen.
>
> This leaves us with static keys and static calls. Any static key used in
> early entry code should be only forever-enabled at boot time, IOW
> __ro_after_init (pretty much like alternatives). Exceptions to that will
> now be caught by objtool.
>
> The deferred instruction sync is the CR3 RMW done as part of
> kPTI when switching to the kernel page table:
>
> SDM vol2 chapter 4.3 - Move to/from control registers:
> ```
> MOV CR* instructions, except for MOV CR8, are serializing instructions.
> ```
>
> Leverage the new kernel_cr3_loaded signal and the kPTI CR3 RMW to defer
> sync_core() IPIs targeting NOHZ_FULL CPUs.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> Signed-off-by: Nicolas Saenz Julienne <nsaenzju@xxxxxxxxxx>
> Signed-off-by: Valentin Schneider <vschneid@xxxxxxxxxx>
> ---
> arch/x86/include/asm/text-patching.h | 5 ++++
> arch/x86/kernel/alternative.c | 34 +++++++++++++++++++++++-----
> arch/x86/kernel/kprobes/core.c | 4 ++--
> arch/x86/kernel/kprobes/opt.c | 4 ++--
> arch/x86/kernel/module.c | 2 +-
> 5 files changed, 38 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
> index f2d142a0a862e..628e80f8318cd 100644
> --- a/arch/x86/include/asm/text-patching.h
> +++ b/arch/x86/include/asm/text-patching.h
> @@ -33,6 +33,11 @@ extern void text_poke_apply_relocation(u8 *buf, const u8 * const instr, size_t i
> */
> extern void *text_poke(void *addr, const void *opcode, size_t len);
> extern void smp_text_poke_sync_each_cpu(void);
> +#ifdef CONFIG_TRACK_CR3
> +extern void smp_text_poke_sync_each_cpu_deferrable(void);
> +#else
> +#define smp_text_poke_sync_each_cpu_deferrable smp_text_poke_sync_each_cpu
> +#endif
> extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
> extern void *text_poke_copy(void *addr, const void *opcode, size_t len);
> #define text_poke_copy text_poke_copy
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 28518371d8bf3..f3af77d7c533c 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -6,6 +6,7 @@
> #include <linux/vmalloc.h>
> #include <linux/memory.h>
> #include <linux/execmem.h>
> +#include <linux/sched/isolation.h>
>
> #include <asm/text-patching.h>
> #include <asm/insn.h>
> @@ -13,6 +14,7 @@
> #include <asm/ibt.h>
> #include <asm/set_memory.h>
> #include <asm/nmi.h>
> +#include <asm/tlbflush.h>
>
> int __read_mostly alternatives_patched;
>
> @@ -2706,11 +2708,29 @@ static void do_sync_core(void *info)
> sync_core();
> }
>
> +static void __smp_text_poke_sync_each_cpu(smp_cond_func_t cond_func)
> +{
> + on_each_cpu_cond(cond_func, do_sync_core, NULL, 1);
> +}
> +
> void smp_text_poke_sync_each_cpu(void)
> {
> - on_each_cpu(do_sync_core, NULL, 1);
> + __smp_text_poke_sync_each_cpu(NULL);
> +}
> +
> +#ifdef CONFIG_TRACK_CR3
> +static bool do_sync_core_defer_cond(int cpu, void *info)
> +{
> + return housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE) ||
> + per_cpu(kernel_cr3_loaded, cpu);

|| should be && ?

Also I would again expect full ordering here with an smp_mb() before the
check. So that:

CPU 0 CPU 1
----- -----
//enter_kernel //do_sync_core_defer_cond
kernel_cr3_loaded = 1 WRITE page table
smp_mb() smp_mb()
WRITE cr3 READ kernel_cr3_loaded

But I'm not sure if that ordering is enough to imply that if CPU 1 observes
kernel_cr3_loaded == 0, then subsequent CPU 0 entering the kernel is guaranteed
to flush the TLB with the latest page table write.

Thoughts?

Thanks.

--
Frederic Weisbecker
SUSE Labs