Re: [PATCH v4 4/7] x86/sev: Add support to perform RMP optimizations asynchronously

From: Kalra, Ashish

Date: Tue May 05 2026 - 16:33:12 EST


Hello Ackerley,

On 5/1/2026 1:57 PM, Ackerley Tng wrote:
> Ashish Kalra <Ashish.Kalra@xxxxxxx> writes:
>
>>
>> [...snip...]
>>
>> +/*
>> + * 'val' is a system physical address.
>> + */
>> +static void rmpopt_smp(void *val)
>> +{
>> + u64 rax = ALIGN_DOWN((u64)val, SZ_1G);
>> + u64 rcx = RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS;
>> +
>> + __rmpopt(rax, rcx);
>> +}
>> +
>> +static void rmpopt(u64 pa)
>> +{
>> + u64 rax = ALIGN_DOWN(pa, SZ_1G);
>> + u64 rcx = RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS;
>> +
>> + __rmpopt(rax, rcx);
>> +}
>> +
>
> Could rmpopt_smp() call rmpopt() to remove duplicate code?

Yes.

>
>> +/*
>> + * RMPOPT optimizations skip RMP checks at 1GB granularity if this
>> + * range of memory does not contain any SNP guest memory.
>> + */
>> +static void rmpopt_work_handler(struct work_struct *work)
>> +{
>> + bool current_cpu_cleared = false;
>> + phys_addr_t pa;
>> +
>> + pr_info("Attempt RMP optimizations on physical address range @1GB alignment [0x%016llx - 0x%016llx]\n",
>> + rmpopt_pa_start, rmpopt_pa_end);
>> +
>> + /*
>> + * RMPOPT scans the RMP table, stores the result of the scan in the
>> + * reserved processor memory. The RMP scan is the most expensive
>> + * part. If a second RMPOPT occurs, it can skip the expensive scan
>> + * if they can see a cached result in the reserved processor memory.
>> + *
>> + * Do RMPOPT on one CPU alone. Then, follow that up with RMPOPT
>> + * on every other primary thread. This potentially allows the
>> + * followers to use the "cached" scan results to avoid repeating
>> + * full scans.
>
> Out of curiosity, how does this caching work? Is it possible to do it
> once and then synchronize the cache to the other CPUs?

The first CPU does the full RMP scan and stores the result of the scan in reserved processor memory.
And other CPUs can skip the scan if they can see a cached result in the reserved processor memory.
So i believe the other CPUs would *still* have to issue the RMPOPT instruction, but then they will
avoid the full RMP scan and use the cached results.

>
>> + */
>> +
>> + if (cpumask_test_cpu(smp_processor_id(), &rmpopt_cpumask)) {
>> + cpumask_clear_cpu(smp_processor_id(), &rmpopt_cpumask);
>> + current_cpu_cleared = true;
>> + }
>> +
>> + /* current CPU */
>> + for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
>> + rmpopt(pa);
>> +
>> + for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
>> + on_each_cpu_mask(&rmpopt_cpumask, rmpopt_smp,
>> + (void *)pa, true);
>> +
>> + /* Give a chance for other threads to run */
>> + cond_resched();
>> +
>> + }
>> +
>> + if (current_cpu_cleared)
>> + cpumask_set_cpu(smp_processor_id(), &rmpopt_cpumask);
>
> Sashiko [1] pointed this out: after cond_resched(), this code might be
> on a different cpu so smp_processor_id() would return a different cpu,
> that would mess up the global cpumask.
>
> Perhaps it's better to store the id on a stack? Or actually, what if we
> give on_each_cpu_mask a copy of rmpopt_cpumask with the current cpu
> unset?
>
> [1] https://sashiko.dev/#/patchset/cover.1775874970.git.ashish.kalra%40amd.com
>

Yes, i think it makes sense to store the id on the stack.

Additionally, i will be moving the computing of the cpumask within this workitem,
so cpumask won't be global.

>> +}
>> +
>> void snp_setup_rmpopt(void)
>> {
>> u64 rmpopt_base;
>> @@ -568,9 +656,20 @@ void snp_setup_rmpopt(void)
>> if (!cpu_feature_enabled(X86_FEATURE_RMPOPT))
>> return;
>>
>> + /*
>> + * Create an RMPOPT-specific workqueue to avoid scheduling
>> + * RMPOPT workitem on the global system workqueue.
>> + */
>> + rmpopt_wq = alloc_workqueue("rmpopt_wq", WQ_UNBOUND, 1);
>> + if (!rmpopt_wq) {
>> + setup_clear_cpu_cap(X86_FEATURE_RMPOPT);
>> + return;
>> + }
>> +
>> /*
>> * RMPOPT_BASE MSR is per-core, so only one thread per core needs to
>> - * setup RMPOPT_BASE MSR.
>> + * setup RMPOPT_BASE MSR. Additionally only one thread per core needs
>> + * to issue the RMPOPT instruction.
>> */
>>
>> for_each_online_cpu(cpu) {
>> @@ -590,6 +689,20 @@ void snp_setup_rmpopt(void)
>> * up to 2 TB of system RAM on all CPUs.
>> */
>> wrmsrq_on_cpus(&rmpopt_cpumask, MSR_AMD64_RMPOPT_BASE, rmpopt_base);
>> +
>> + INIT_DELAYED_WORK(&rmpopt_delayed_work, rmpopt_work_handler);
>> +
>> + rmpopt_pa_end = ALIGN(PFN_PHYS(max_pfn), SZ_1G);
>> +
>> + /* Limit memory scanning to the first 2 TB of RAM */
>
> I think this is better phrased as "limit memory scanning to 2TB",
>

Ok.

>> + if ((rmpopt_pa_end - rmpopt_pa_start) > SZ_2T)
>> + rmpopt_pa_end = rmpopt_pa_start + SZ_2T;
>
> and then this could be
>
> rmpopt_pa_end = min(rmpopt_pa_end, rmpopt_pa_start + SZ_2T);
>

Yes.

Thanks,
Ashish

>> +
>> + /*
>> + * Once all per-CPU RMPOPT tables have been configured, enable RMPOPT
>> + * optimizations on all physical memory.
>> + */
>> + queue_delayed_work(rmpopt_wq, &rmpopt_delayed_work, 0);
>> }
>> EXPORT_SYMBOL_FOR_MODULES(snp_setup_rmpopt, "ccp");
>>
>
> Reviewed-by: Ackerley Tng <ackerleytng@xxxxxxxxxx>