Re: [PATCH v2 3/7] x86/sev: add support for RMPOPT instruction

From: Kalra, Ashish

Date: Wed Mar 04 2026 - 20:43:51 EST

Hello Dave and Sean,

On 3/4/2026 9:25 AM, Dave Hansen wrote:
> On 3/4/26 07:01, Sean Christopherson wrote:
>> I don't see any performance data in either posted version. Bluntly, this series
>> isn't going anywhere without data to guide us. E.g. comments like this from v1
>>
>> : And there is a cost associated with re-enabling the optimizations for all
>> : system RAM (even though it runs as a background kernel thread executing RMPOPT
>> : on different 1GB regions in parallel and with inline cond_resched()'s),
>> : we don't want to run this periodically.
>>
>> suggest there is meaningful cost associated with the scan.
>
> Well the RMP is 0.4% of the size of system memory, and I assume that you
> need to scan the whole table. There are surely shortcuts for 2M pages,
> but with 4k, that's ~8.5GB of RMP table for 2TB of memory. That's an
> awful lot of memory traffic for each CPU.

The RMPOPT instruction is optimized for 2M pages, so it checks that
all 512 2MB entries in that 1GB region are not assigned, i.e., for each
2MB RMP in the 1GB region containing the specified address it checks if
they are not assigned.

>
> It'll be annoying to keep a refcount per 1GB of paddr space.
>
> One other way to do it would be to loosely mirror the RMPOPT bitmap and
> keep our own bitmap of 1GB regions that _need_ RMPOPT run on them. Any
> private=>shared conversion sets a bit in the bitmap and schedules some
> work out in the future.
>
> It could also be less granular than that. Instead of any private=>shared
> conversion, the RMPOPT scan could be triggered on VM destruction which
> is much more likely to result in RMPOPT doing anything useful.

Yes, it will need to be more granular than scheduling RMPOPT work for any
private->shared conversion.

And that's what we are doing in v2 patch series, RMPOPT scan getting
triggered on VM destruction.

>
> BTW, I assume that the RMPOPT disable machinery is driven from the
> INVLPGB-like TLB invalidations that are a part of the SNP
> shared=>private conversions. It's a darn shame that RMPOPT wasn't
> broadcast in the same way. It would save the poor OS a lot of work. The
> RMPOPT table is per-cpu of course, but I'm not sure what keeps *a* CPU
> from broadcasting its success finding an SNP-free physical region to
> other CPUs.

The hardware does this broadcast for the RMPUPDATE instruction,
a broadcast will be sent in the RMPUPDATE instruction to clear matching entries
in other RMPOPT tables. This broadcast will be sent to all CPUs.

For the RMPOPT instruction itself, there is no such broadcast, but RMPOPT
instruction needs to be executed on only one thread per core, the
per-CPU RMPOPT table of the other sibling thread will be programmed while
executing the same instruction.

That's the reason, why we had this optimization to executing RMPOPT instruction
on only the primary thread as part of the v1 patch series and i believe we should
include this optimization as part of future series.

>
> tl;dr: I agree with you. The cost of these scans is going to be
> annoying, and it's going to need OS help to optimize it.

Here is some performance data:

Raw CPU cycles for a single RMPOPT instruction, func=0 :

RMPOPT during snp_rmptable_init() while booting:

....
[ 12.098580] SEV-SNP: RMPOPT max. CPU cycles 501460
[ 12.103839] SEV-SNP: RMPOPT min. CPU cycles 60
[ 12.108799] SEV-SNP: RMPOPT average cycles 139790

RMPOPT during SNP_INIT_EX, at CCP module load at boot:

[ 40.206619] SEV-SNP: RMPOPT max. CPU cycles 248083620
[ 40.206629] SEV-SNP: RMPOPT min. CPU cycles 60
[ 40.206629] SEV-SNP: RMPOPT average cycles 249820

RMPOPT after SNP guest shutdown:
...
[ 298.746893] SEV-SNP: RMPOPT max. CPU cycles 248083620
[ 298.746898] SEV-SNP: RMPOPT min. CPU cycles 60
[ 298.746900] SEV-SNP: RMPOPT average cycles 127859

I believe the min. CPU cycles is the case where RMPOPT fails early.

Raw CPU cycles for one complete iteration of executing RMPOPT (func=0) on all CPUs for the whole RAM:

This is for this complete loop with cond_resched() removed.

while (!kthread_should_stop()) {
phys_addr_t pa;

pr_info("RMP optimizations enabled on physical address range @1GB alignment [0x%016llx - 0x%016llx]\n",
pa_start, pa_end);

start = rdtsc_ordered();
/*
* RMPOPT optimizations skip RMP checks at 1GB granularity if this range of
* memory does not contain any SNP guest memory.
*/
for (pa = pa_start; pa < pa_end; pa += PUD_SIZE) {
/* Bit zero passes the function to the RMPOPT instruction. */
on_each_cpu_mask(cpu_online_mask, rmpopt,
(void *)(pa | RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS),
true);

}
end = rdtsc_ordered();

pr_info("RMPOPT cycles taken for physical address range 0x%016llx - 0x%016llx on all cpus %llu cycles\n",
pa_start, pa_end, end - start);

set_current_state(TASK_INTERRUPTIBLE);
schedule();
}

RMPOPT during snp_rmptable_init() while booting:

...
[ 12.114047] SEV-SNP: RMPOPT cycles taken for physical address range 0x0000000000000000 - 0x0000010380000000 on all cpus 1499496600 cycles

RMPOPT during SNP_INIT_EX:
...
[ 40.206630] SEV-SNP: RMPOPT cycles taken for physical address range 0x0000000000000000 - 0x0000010380000000 on all cpus 686519180 cycles

RMPOPT after SNP guest shutdown:
...
[ 298.746900] SEV-SNP: RMPOPT cycles taken for physical address range 0x0000000000000000 - 0x0000010380000000 on all cpus 369059160 cycles

Thanks,
Ashish