Re: [PATCH v2 3/7] x86/sev: add support for RMPOPT instruction
From: Dave Hansen
Date: Wed Mar 11 2026 - 18:20:58 EST
On 3/11/26 14:24, Kalra, Ashish wrote:
...
> There are 2 active SNP VMs here, with one SNP VM being terminated, the other SNP VM is still running, both VMs are configured with 100GB guest RAM:
>
> When this loop is executed when the SNP guest terminates:
>
> [ 232.789187] SEV-SNP: RMPOPT execution time 391609638 ns for physical address range 0x0000000000000000 - 0x0000020000000000 on all cpus -> ~391 ms
>
> [ 234.647462] SEV-SNP: RMPOPT execution time 457933019 ns for physical address range 0x0000000000000000 - 0x0000020000000000 on all cpus -> ~457 ms
That's better, but it's not quite what am looking for.
The most important case (IMNHO) is when RMPOPT falls flat on its face:
it tries to optimize the full 2TB of memory and manages to optimize nothing.
I doubt that two 100GB VMs will get close to that case. It's
theoretically possible, but unlikely.
You also didn't mention 4k vs. 2M vs. 1G mappings.
> Now, there are a couple of additional RMPOPT optimizations which can be applied to this loop :
>
> 1). RMPOPT can skip the bulk of its work if another CPU has already optimized that region.
> The optimal thing may be to optimize all memory on one CPU first, and then let all the others
> run RMPOPT in parallel.
Ahh, so the RMP table itself caches the result of the RMPOPT in its 1G
metadata, then the CPUs can just copy it into their core-local
optimization table at RMPOPT time?
That's handy.
*But*, for the purposes of finding pathological behavior, it's actually
contrary to what I think I was asking for which was having all 1G pages
filled with some private memory. If the system was in the state I want
to see tested, that optimization won't function.
> [ 363.926595] SEV-SNP: RMPOPT execution time 317016656 ns for physical address range 0x0000000000000000 - 0x0000020000000000 on all cpus -> ~317 ms
>
> [ 365.415243] SEV-SNP: RMPOPT execution time 369659769 ns for physical address range 0x0000000000000000 - 0x0000020000000000 on all cpus -> ~369 ms.
>
> So, with these two optimizations applied, there is like a ~16-20% performance improvement (when SNP guest terminates) in the execution of this loop
> which is executing RMPOPT on upto 2TB of RAM on all CPUs.
>
> Any thoughts, feedback on the performance numbers ?
16-20% isn't horrible, but it isn't really a fundamental change.
It would also be nice to see elapsed time for each CPU. Having one
pegged CPU for 400ms and 99 mostly idle ones is way different than
having 100 pegged CPUs for 400ms.
That's why I was interested in "how long it takes per-cpu".
But you could get some pretty good info with your new optimized loop:
start = ktime_get();
for (pa = pa_start; pa < pa_end; pa += PUD_SIZE)
rmpopt() // current CPU
middle = ktime_get();
for (pa = pa_start; pa < pa_end; pa += PUD_SIZE)
on_each_cpu_mask(...) // remote CPUs
end = ktime_get();
If you do that ^ with a system:
1. full of private memory
2. empty of private memory
3. empty again
You'll hopefully see:
1. RMPOPT fall on its face. Worst case scenario (what I want to
see most)
2. RMPOPT sees great success, but has to scan the RMP at least
once. Remote CPUs get a free ride on the first CPU's scan.
Largest (middle-start) vs. (end-middle)/nr_cpus delta.
3. RMPOPT best case. Everything is already optimized.
> Ideally we should be issuing RMPOPTs to only optimize the 1G regions that contained memory associated with that guest and that should be
> significantly less than the whole 2TB RAM range.
>
> But that is something we planned for 1GB hugetlb guest_memfd support getting merged and which i believe has dependency on:
> 1). in-place conversion for guest_memfd,
> 2). 2M hugepage support for guest_memfd and finally
> 3). 1GB hugeTLB support for guest_memfd.
It's a no-brainer to do RMPOPT when you have 1GB pages around. You'll
see zero argument from me.
Doing things per-guest and for smaller pages gets a little bit harder to
reason about. In the end, this is all about trying to optimize against
the RMP table which is a global resource. It's going to get wonky if
RMPOPT is driven purely by guest-local data. There are lots of potential
pitfalls.
For now, let's just do it as simply as possible. Get maximum bang for
our buck with minimal data structures and see how that works out. It
might end up being a:
queue_delayed_work()
to do some cleanup a few seconds out after each SNP guest terminates. If
a bunch of guests terminate all at once it'll at least only do a single
set of IPIs.