Re: [PATCH RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time
From: Vishal Annapurve
Date: Thu Mar 27 2025 - 04:15:05 EST
On Thu, Mar 13, 2025 at 11:17 AM Adrian Hunter <adrian.hunter@xxxxxxxxx> wrote:
> ...
> == Problem ==
>
> Currently, Dynamic Page Removal is being used when the TD is being
> shutdown for the sake of having simpler initial code.
>
> This happens when guest_memfds are closed, refer kvm_gmem_release().
> guest_memfds hold a reference to struct kvm, so that VM destruction cannot
> happen until after they are released, refer kvm_gmem_release().
>
> Reclaiming TD Pages in TD_TEARDOWN State was seen to decrease the total
> reclaim time. For example:
>
> VCPUs Size (GB) Before (secs) After (secs)
> 4 18 72 24
> 32 107 517 134
If the time for reclaim grows linearly with memory size, then this is
a significantly high value for TD cleanup (~21 minutes for a 1TB VM).
>
> Note, the V19 patch set:
>
> https://lore.kernel.org/all/cover.1708933498.git.isaku.yamahata@xxxxxxxxx/
>
> did not have this issue because the HKID was released early, something that
> Sean effectively NAK'ed:
>
> "No, the right answer is to not release the HKID until the VM is
> destroyed."
>
> https://lore.kernel.org/all/ZN+1QHGa6ltpQxZn@xxxxxxxxxx/
IIUC, Sean is suggesting to treat S-EPT page removal and page reclaim
separately. Through his proposal:
1) If userspace drops last reference on gmem inode before/after
dropping the VM reference
-> slow S-EPT removal and slow page reclaim
2) If memslots are removed before closing the gmem and dropping the VM reference
-> slow S-EPT page removal and no page reclaim until the gmem is around.
Reclaim should ideally happen when the host wants to use that memory
i.e. for following scenarios:
1) Truncation of private guest_memfd ranges
2) Conversion of private guest_memfd ranges to shared when supporting
in-place conversion (Could be deferred to the faulting in as shared as
well).
Would it be possible for you to provide the split of the time spent in
slow S-EPT page removal vs page reclaim?
It might be worth exploring the possibility of parallelizing or giving
userspace the flexibility to parallelize both these operations to
bring the cleanup time down (to be comparable with non-confidential VM
cleanup time for example).