Re: [PATCH v2 3/7] x86/sev: add support for RMPOPT instruction

From: Sean Christopherson

Date: Wed Mar 04 2026 - 10:19:54 EST

On Mon, Mar 02, 2026, Ashish Kalra wrote:
> @@ -500,6 +508,61 @@ static bool __init setup_rmptable(void)
> +/*
> + * 'val' is a system physical address aligned to 1GB OR'ed with
> + * a function selection. Currently supported functions are 0
> + * (verify and report status) and 1 (report status).
> + */
> +static void rmpopt(void *val)
> +{
> + asm volatile(".byte 0xf2, 0x0f, 0x01, 0xfc"
> + : : "a" ((u64)val & PUD_MASK), "c" ((u64)val & 0x1)
> + : "memory", "cc");
> +}
> +
> +static int rmpopt_kthread(void *__unused)
> +{
> + phys_addr_t pa_start, pa_end;
> +
> + pa_start = ALIGN_DOWN(PFN_PHYS(min_low_pfn), PUD_SIZE);
> + pa_end = ALIGN(PFN_PHYS(max_pfn), PUD_SIZE);
> +
> + /* Limit memory scanning to the first 2 TB of RAM */
> + pa_end = (pa_end - pa_start) <= SZ_2T ? pa_end : pa_start + SZ_2T;
> +
> + while (!kthread_should_stop()) {
> + phys_addr_t pa;
> +
> + pr_info("RMP optimizations enabled on physical address range @1GB alignment [0x%016llx - 0x%016llx]\n",
> + pa_start, pa_end);
> +
> + /*
> + * RMPOPT optimizations skip RMP checks at 1GB granularity if this range of
> + * memory does not contain any SNP guest memory.
> + */
> + for (pa = pa_start; pa < pa_end; pa += PUD_SIZE) {
> + /* Bit zero passes the function to the RMPOPT instruction. */
> + on_each_cpu_mask(cpu_online_mask, rmpopt,
> + (void *)(pa | RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS),
> + true);
> +
> + /* Give a chance for other threads to run */

I'm not terribly concerned with other threads, but I am most definitely concerned
about other CPUs. IIUC, *every* time a guest_memfd file is destroyed, the kernel
will process *every* 2MiB chunk of memory, interrupting *every* CPU in the process.

Given that the whole point of RMPOPT is to allow running non-SNP and SNP VMs
side-by-side, inducing potentially significant jitter when stopping SNP VMs seems
like a dealbreaker.

Even using a kthread seems flawed, e.g. if all CPUs in the system are being used
to run VMs, then the kernel could be stealing cycles from an arbitrary VM/vCPU to
process RMPOPT. Contrast that with KVM's NX hugepage recovery thread, which is
spawned in the context of a specific VM so that recovering steady state performance
at the cost of periodically consuming CPU cycles is bound entirely to that VM.

I don't see any performance data in either posted version. Bluntly, this series
isn't going anywhere without data to guide us. E.g. comments like this from v1

: And there is a cost associated with re-enabling the optimizations for all
: system RAM (even though it runs as a background kernel thread executing RMPOPT
: on different 1GB regions in parallel and with inline cond_resched()'s),
: we don't want to run this periodically.

suggest there is meaningful cost associated with the scan.