Re: [PATCH v2 5/7] KVM: guest_memfd: Add cleanup interface for guest teardown

From: Ackerley Tng

Date: Wed Mar 11 2026 - 02:00:46 EST

"Kalra, Ashish" <ashish.kalra@xxxxxxx> writes:

> Hello Ackerley,
>
> On 3/9/2026 4:01 AM, Ackerley Tng wrote:
>> Ashish Kalra <Ashish.Kalra@xxxxxxx> writes:
>>
>>> From: Ashish Kalra <ashish.kalra@xxxxxxx>
>>>
>>> Introduce kvm_arch_gmem_cleanup() to perform architecture-specific
>>> cleanups when the last file descriptor for the guest_memfd inode is
>>> closed. This typically occurs during guest shutdown and termination
>>> and allows for final resource release.
>>>
>>> Signed-off-by: Ashish Kalra <ashish.kalra@xxxxxxx>
>>> ---
>>>
>>> [...snip...]
>>>
>>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>>> index 017d84a7adf3..2724dd1099f2 100644
>>> --- a/virt/kvm/guest_memfd.c
>>> +++ b/virt/kvm/guest_memfd.c
>>> @@ -955,6 +955,14 @@ static void kvm_gmem_destroy_inode(struct inode *inode)
>>>
>>> static void kvm_gmem_free_inode(struct inode *inode)
>>> {
>>> +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_CLEANUP
>>> + /*
>>> + * Finalize cleanup for the inode once the last guest_memfd
>>> + * reference is released. This usually occurs after guest
>>> + * termination.
>>> + */
>>> + kvm_arch_gmem_cleanup();
>>> +#endif
>>
>> Folks have already talked about the performance implications of doing
>> the scan and rmpopt, I just want to call out that one VM could have more
>> than one associated guest_memfd too.
>
> Yes, i have observed that kvm_gmem_free_inode() gets invoked multiple times
> at SNP guest shutdown.
>
> And the same is true for kvm_gmem_destroy_inode() too.
>
>>
>> I think the cleanup function should be thought of as cleanup for the
>> inode (even if it doesn't take an inode pointer since it's not (yet)
>> required).
>>
>> So, the gmem cleanup function should not handle deduplicating cleanup
>> requests, but the arch function should, if the cleanup needs
>> deduplicating.
>
> I agree, the arch function will have to handle deduplicating, and for that
> the arch function will probably need to be passed the inode pointer,
> to have a parameter to assist with deduplicating.
>

By the time .free_folio() is called, folio->mapping may no longer exist,
so if we definitely want to deduplicate using something in the inode,
.free_folio() won't be the right callback to use.

I was thinking that deduplicating using something in the folio would be
better. Can rmpopt take a PFN range? Then there's really no
deduplication, the cleanup would be nicely narrowed to whatever was just
freed. Perhaps the PFNs could be aligned up to the nearest PMD or PUD
size for rmpopt to do the right thing.

Or perhaps some more tracking is required to check that the entire
aligned range is freed before doing the rmpopt.

I need to implement some of this tracking for guest_memfd HugeTLB
support, so if the tracking is useful for you, we should discuss!

>>
>> Also, .free_inode() is called through RCU, so it could be called after
>> some delay. Could it be possible that .free_inode() ends up being called
>> way after the associated VM gets torn down, or after KVM the module gets
>> unloaded? Does rmpopt still work fine if KVM the module got unloaded?
>
> Yes, .free_inode() can probably get called after the associated VM has
> been torn down and which should be fine for issuing RMPOPT to do
> RMP re-optimizations.
>
> As far as about KVM module getting unloaded, then as part of the forthcoming patch-series,
> during KVM module unload, X86_SNP_SHUTDOWN would be issued which means SNP would get
> disabled and therefore, RMP checks are also disabled.
>
> And as CC_ATTR_HOST_SEV_SNP would then be cleared, therefore, snp_perform_rmp_optimization()
> will simply return.
>

I think relying on CC_ATTR_HOST_SEV_SNP to skip optimization should be
best as long as there are no races (like the .free_inode() will
definitely not try to optimize when SNP is half shut down or something
like that.

> Another option is to add a new guest_memfd superblock operation, and then do the
> final guest_memfd cleanup using the .evict_inode() callback. This will then ensure
> that the cleanup is not called through RCU and avoids any kind of delays, as following:
>
> +static void kvm_gmem_evict_inode(struct inode *inode)
> +{
> +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_CLEANUP
> + kvm_arch_gmem_cleanup();
> +#endif
> + truncate_inode_pages_final(&inode->i_data);
> + clear_inode(inode);
> +}
> +
>

At the point of .evict_inode(), CoCo-shared guest_memfd pages could
still be pinned (for DMA or whatever, accidentally or maliciously), can
rmpopt work on shared pages that might still be used for DMA?

.invalidate_folio() and .free_folio() both actually happen on removal
from guest_memfd ownership, though both are not exactly when the folio
is completely not in use.

Is the best time to optimize when the pages are truly freed?

> @@ -971,6 +979,7 @@ static const struct super_operations kvm_gmem_super_operations = {
> .alloc_inode = kvm_gmem_alloc_inode,
> .destroy_inode = kvm_gmem_destroy_inode,
> .free_inode = kvm_gmem_free_inode,
> + .evict_inode = kvm_gmem_evict_inode,
> };
>
>
> Thanks,
> Ashish
>
>>
>> IIUC the current kmem_cache_free(kvm_gmem_inode_cachep, GMEM_I(inode));
>> is fine because in kvm_gmem_exit(), there is a rcu_barrier() before
>> kmem_cache_destroy(kvm_gmem_inode_cachep);.
>>
>>> kmem_cache_free(kvm_gmem_inode_cachep, GMEM_I(inode));
>>> }
>>>
>>> --
>>> 2.43.0