Re: [PATCH RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time

From: Adrian Hunter
Date: Thu Mar 27 2025 - 06:12:13 EST


On 27/03/25 10:14, Vishal Annapurve wrote:
> On Thu, Mar 13, 2025 at 11:17 AM Adrian Hunter <adrian.hunter@xxxxxxxxx> wrote:
>> ...
>> == Problem ==
>>
>> Currently, Dynamic Page Removal is being used when the TD is being
>> shutdown for the sake of having simpler initial code.
>>
>> This happens when guest_memfds are closed, refer kvm_gmem_release().
>> guest_memfds hold a reference to struct kvm, so that VM destruction cannot
>> happen until after they are released, refer kvm_gmem_release().
>>
>> Reclaiming TD Pages in TD_TEARDOWN State was seen to decrease the total
>> reclaim time. For example:
>>
>> VCPUs Size (GB) Before (secs) After (secs)
>> 4 18 72 24
>> 32 107 517 134
>
> If the time for reclaim grows linearly with memory size, then this is
> a significantly high value for TD cleanup (~21 minutes for a 1TB VM).
>
>>
>> Note, the V19 patch set:
>>
>> https://lore.kernel.org/all/cover.1708933498.git.isaku.yamahata@xxxxxxxxx/
>>
>> did not have this issue because the HKID was released early, something that
>> Sean effectively NAK'ed:
>>
>> "No, the right answer is to not release the HKID until the VM is
>> destroyed."
>>
>> https://lore.kernel.org/all/ZN+1QHGa6ltpQxZn@xxxxxxxxxx/
>
> IIUC, Sean is suggesting to treat S-EPT page removal and page reclaim
> separately. Through his proposal:

Thanks for looking at this!

It seems I am using the term "reclaim" wrongly. Sorry!

I am talking about taking private memory away from the guest,
not what happens to it subsequently. When the TDX VM is in "Runnable"
state, taking private memory away is slow (slow S-EPT removal).
When the TDX VM is in "Teardown" state, taking private memory away
is faster (a TDX SEAMCALL named TDH.PHYMEM.PAGE.RECLAIM which is where
I picked up the term "reclaim")

Once guest memory is removed from S-EPT, further action is not
needed to reclaim it. It belongs to KVM at that point.

guest_memfd memory can be added directly to S-EPT. No intermediate
state or step is used. Any guest_memfd memory not given to the
MMU (S-EPT), can be freed directly if userspace/KVM wants to.
Again there is no intermediate state or (reclaim) step.

> 1) If userspace drops last reference on gmem inode before/after
> dropping the VM reference
> -> slow S-EPT removal and slow page reclaim

Currently slow S-EPT removal happens when the file is released.

> 2) If memslots are removed before closing the gmem and dropping the VM reference
> -> slow S-EPT page removal and no page reclaim until the gmem is around.
>
> Reclaim should ideally happen when the host wants to use that memory
> i.e. for following scenarios:
> 1) Truncation of private guest_memfd ranges
> 2) Conversion of private guest_memfd ranges to shared when supporting
> in-place conversion (Could be deferred to the faulting in as shared as
> well).
>
> Would it be possible for you to provide the split of the time spent in
> slow S-EPT page removal vs page reclaim?

Based on what I wrote above, all the time is spent removing pages
from S-EPT. Greater that 99% of shutdown time is kvm_gmem_release().

>
> It might be worth exploring the possibility of parallelizing or giving
> userspace the flexibility to parallelize both these operations to
> bring the cleanup time down (to be comparable with non-confidential VM
> cleanup time for example).