Re: [PATCH v3 3/6] x86/sev: Add support to perform RMP optimizations asynchronously

From: Kalra, Ashish

Date: Mon Mar 30 2026 - 20:47:13 EST

On 3/30/2026 6:22 PM, Dave Hansen wrote:
> On 3/30/26 15:26, Ashish Kalra wrote:
> ...
>> As SEV-SNP is enabled by default on boot when an RMP table is
>> allocated by BIOS, the hypervisor and non-SNP guests are subject to
>> RMP write checks to provide integrity of SNP guest memory.
>
> This is a long-winded way of saying:
>
> When SEV-SNP is enabled, all writes to memory are checked to
> ensure integrity of SNP guest memory. This imposes performance
> overhead on the whole system.
>
>> RMPOPT is a new instruction that minimizes the performance overhead of
>> RMP checks on the hypervisor and on non-SNP guests by allowing RMP
>> checks to be skipped for 1GB regions of memory that are known not to
>> contain any SEV-SNP guest memory.
>>
>> Add support for performing RMP optimizations asynchronously using a
>> dedicated workqueue, scheduling delayed work to perform RMP
>> optimizations every 10 seconds.
>
> Gah, does it really do this _every_ 10 seconds? Whether or not any
> guests are running or if the SEV-SNP state has changed at *all*? This
> code doesn't implement that, right? If so, why mention it here?
>
>> +static void rmpopt_work_handler(struct work_struct *work)
>> +{
>> + phys_addr_t pa;
>> +
>> + pr_info("Attempt RMP optimizations on physical address range @1GB alignment [0x%016llx - 0x%016llx]\n",
>> + rmpopt_pa_start, rmpopt_pa_end);
>> +
>> + /*
>> + * RMPOPT optimizations skip RMP checks at 1GB granularity if this
>> + * range of memory does not contain any SNP guest memory. Optimize
>> + * each range on one CPU first, then let other CPUs execute RMPOPT
>> + * in parallel so they can skip most work as the range has already
>> + * been optimized.
>> + */
>
> This comment could be much more clear.
>
> First, the granularity has *zero* to do with this optimization.
>
> Second, the optimization this code is doing only makes sense if the RMP
> itself is caching the RMPOPT result in a global, single place. That's
> not explained. It needs something like:
>
> RMPOPT does three things: It scans the RMP table, stores the
> result of the scan in the global RMP table and copies that
> result to a per-CPU table. The scan is the most expensive part.
> If a second RMPOPT occurs, it can skip the expensive scan if it
> sees the "cached" scan result in the RMP.
>
> Do RMPOPT on one CPU alone. Then, follow that up with RMPOPT
> on every other primary thread. This potentially allows the
> followers to use the "cached" scan results to avoid repeating
> full scans.
>
>> + cpumask_clear_cpu(smp_processor_id(), &primary_threads_cpumask);
>
> How do you know that the current CPU is *in* 'primary_threads_cpumask'
> in the first place? I guess it doesn't hurt to do RMPOPT in two places,
> but why not just be careful about it?
>
> Also, logically, 'primary_threads_cpumask' never changes (modulo CPU
> hotplug). The thing you're tracking here is "primary CPUs that need to
> have RMPOPT executed on them". That's a far different thing than the
> name for the variable.

Ok, i will name it more appropriately.

>
>> + /* current CPU */
>> + for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
>> + rmpopt((void *)pa);
>
> This _looks_ rather wonky because it's casting a 'pa' to a virtual
> address for no apparent reason.

This is re-using the rmpopt() wrapper, which exists for using the
on_each_cpu_mask() call below, and that needs the "void *' parameter.

The on_each_cpu_mask() is needed to execute RMPOPT instructions in
parallel, as that is part of the bare minimum optimizations needed
for the RMPOPT loop.

I will add another wrapper for calling rmpopt() with the 'pa'
parameter directly.

>
> Also, rmpopt() itself does 1G alignment. This code ^ also aligns the
> start and end. Why?

Yes, RMPOPT instruction does 1G alignment on the specified PA, this code
alignment is to ensure the PFNs - min_low_pfn and max_pfn are aligned
appropriately.

>
>> + for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
>> + on_each_cpu_mask(&primary_threads_cpumask, rmpopt,
>> + (void *)pa, true);
>> +
>> + /* Give a chance for other threads to run */
>> + cond_resched();
>> +
>> + }
>> +
>> + cpumask_set_cpu(smp_processor_id(), &primary_threads_cpumask);
>> +}
>
> Honestly, I _really_ wish this series would dispense with *all* the
> optimizations in the first version. This looks really wonky because
> 'primary_threads_cpumask' is a global variable and is initialized before
> the work function when it could probably be done within the work function.
>

It is static and will never change, so i was doing it here outside the work function,
but i can move it inside the work function.

> It's also *really* generically and non-descriptively named for a
> global-scope variable.
>

I can name it appropriately.

I definitely want to have the bare minimum set of optimizations for the RMPOPT loop
in place for the first series of patches, which is:

1). Issuing RMPOPT on one only one CPU first, before doing it in parallel on
all other CPUs.

2). Issuing RMPOPT on only one thread per core.

This is the bare minimum set of optimizations for RMPOPT loop, for which
i had the performance numbers computed before:

Best runtime for the RMPOPT loop:
When RMPOPT exits early as it finds an assigned page on the first RMP entry it checks in the 1GB -> ~95ms.

Worst runtime for the RMPOPT loop:
When RMPOPT does not find any assigned page in the full 1GB range it is checking -> ~311ms.

The following series will have support for system RAM > 2TB and other optimizations, but
these bare minimum set of optimizations for RMPOPT should definitely be part of initial
patch series.

>> +static void rmpopt_all_physmem(bool early)
>> +{
>> + if (!rmpopt_wq)
>> + return;
>> +
>> + if (early)
>> + queue_delayed_work(rmpopt_wq, &rmpopt_delayed_work,
>> + msecs_to_jiffies(1));
>> + else
>> + queue_delayed_work(rmpopt_wq, &rmpopt_delayed_work,
>> + msecs_to_jiffies(RMPOPT_WORK_TIMEOUT));
>> +}
>
> This is rather unfortunate on several levels.
>
> First, even if the 'bool early' thing was a good idea, this should be
> written:
>
> unsigned long timeout = RMPOPT_WORK_TIMEOUT;
>
> if (early)
> timeout = 1;
>
> queue_delayed_work(rmpopt_wq,
> &rmpopt_delayed_work,
> msecs_to_jiffies(timeout));
>
> But, really, why does it even *need* a bool for early/late? Just do a
> late_initcall() if you want this done near boot time.
>

Ok.

>
>> static __init void configure_and_enable_rmpopt(void)
>> {
>> phys_addr_t pa_start = ALIGN_DOWN(PFN_PHYS(min_low_pfn), SZ_1G);
>> @@ -499,6 +582,37 @@ static __init void configure_and_enable_rmpopt(void)
>> */
>> for_each_online_cpu(cpu)
>> wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, rmpopt_base);
>
> What is the scope of MSR_AMD64_RMPOPT_BASE? Can you have it enabled on
> one thread and not the other? Could they be different values both for
> enabling and the rmpopt_base value?
>
> If it's not per-thread, then why is it being initialized for each thread?
>

Only one logical thread per core needs to set RMPOPT_BASE MSR as it is per-core,
so i will use the "primary_threads_cpumask" here to use it for programming this
MSR.

Just another reason, to set the "primary_threads_cpumask" here in this function
and then re-use it for the RMPOPT worker.

>> + /*
>> + * Create an RMPOPT-specific workqueue to avoid scheduling
>> + * RMPOPT workitem on the global system workqueue.
>> + */
>> + rmpopt_wq = alloc_workqueue("rmpopt_wq", WQ_UNBOUND, 1);
>> + if (!rmpopt_wq)
>> + return;
>
> I'd probably just put this first. Then if the allocation fails, you
> don't even bother doing the WRMSRs. Heck if you did that, you could even
> use the MSR bit for the indicator of if RMPOPT is supported.

Ok.

>
>> + INIT_DELAYED_WORK(&rmpopt_delayed_work, rmpopt_work_handler);
>> +
>> + rmpopt_pa_start = pa_start;
>
> Why is there a 'rmpopt_pa_start' and 'pa_start'?

Again, just statically setting the 'rmpopt_pa_start' for the RMPOPT worker,
but i can just use 'rmpopt_pa_start' here.

Thanks,
Ashish

>
>> + rmpopt_pa_end = ALIGN(PFN_PHYS(max_pfn), SZ_1G);
>> +
>> + /* Limit memory scanning to the first 2 TB of RAM */
>> + if ((rmpopt_pa_end - rmpopt_pa_start) > SZ_2T)
>> + rmpopt_pa_end = rmpopt_pa_start + SZ_2T;
>> +
>> + /* Only one thread per core needs to issue RMPOPT instruction */
>> + for_each_online_cpu(cpu) {
>> + if (!topology_is_primary_thread(cpu))
>> + continue;
>> +
>> + cpumask_set_cpu(cpu, &primary_threads_cpumask);
>> + }
>> +
>> + /*
>> + * Once all per-CPU RMPOPT tables have been configured, enable RMPOPT
>> + * optimizations on all physical memory.
>> + */
>> + rmpopt_all_physmem(TRUE);
>> }