Re: [RFC PATCH 5/6] KVM: Add flags to struct kvm_gfn_range
From: Isaku Yamahata
Date: Tue Jun 20 2023 - 15:04:58 EST
On Tue, Jun 20, 2023 at 11:28:35AM -0500,
Michael Roth <michael.roth@xxxxxxx> wrote:
> On Thu, Jun 15, 2023 at 01:12:18PM -0700, isaku.yamahata@xxxxxxxxx wrote:
> > From: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>
> >
> > TDX and SEV-SNP need to know the reason for a callback by
> > kvm_unmap_gfn_range(). mmu notifier, set memory attributes ioctl or KVM
> > gmem callback. The callback handler changes the behavior or does the
> > additional housekeeping operation. For mmu notifier, it's zapping shared
> > PTE. For set memory attributes, it's the conversion of memory attributes
> > (private <=> shared). For KVM gmem, it's punching a hole in the range, and
> > releasing the file.
>
> I think it's still an open topic that we need to hear more from Sean about:
>
> https://lore.kernel.org/lkml/20230522235838.ov3722lcusotzlvo@xxxxxxx/
>
> but I *think* we were leaning toward decoupling the act of invalidating
> GFNs, vs. the act of invalidating/free'ing gmem pages.
>
> One concrete example of where this seperation makes sense if with
> hole-punching. SNP has unique platform-specific stuff it has to do before
> free'ing that gmem page back to the host. If we try to plumb this through
> kvm_unmap_gfn_range() via a special flag then it's a little awkward
> because:
>
> a) Presumably that hole-punch would have occurred after a preceeding
> KVM_SET_MEMORY_ATTRIBUTES was issued to switch the page to shared
> state in the xarray. So all it should really need to do is handle
> that platform-specific behavior, like updating RMP table in case of
> SNP. But none of the other details like GFN ranges are relevant in
> that case, RMP updates here only need the PFN, so we end up walking
> memslots to do GFN->PFN translations, when it would actually be much
> more efficient to do these translations by translating the
> hole-punched FD offset range to the corresponding folio()'s backing
> those ranges
>
> b) It places an unecessary dependency on having an active memslot to do
> those translations. This ends up not being an issue with current
> version of gmem patchset because the release() happens *before*
> gmem_unbind(), so there is a memslot associated with the ranges at
> gmem_release() time, but in the initial version of gmem it was the
> reverse, so if things ever changed again in this regard we'd once
> again have to completely rework how to issue these platform-specific
> invalidation callbacks.
>
> I really *really* like having a separate, simple invalidation mechanism
> in place that just converts FD offsets to PFNs and then passes those on
> to platform-defined handlers to clean up pages before free'ing them back
> to the system. It's versatile in that it can be called pretty much
> anywhere regardless of where we are in KVM lifecycle, it's robust in
> that it doesn't rely on unecessary outside dependencies, and it avoids
> added uneeded complexity to paths like kvm_unmap_gfn_range().
>
> That's the approach taken with SNP hypervisor v9 series, with the
> gmem hook being introduced here:
>
> https://lore.kernel.org/kvm/20230612042559.375660-1-michael.roth@xxxxxxx/T/#m3ad8245235a27ed0f41c359c191dcda6c77af043
>
> and the SEV-SNP implementation of that hook being here:
>
> https://lore.kernel.org/kvm/20230612042559.375660-1-michael.roth@xxxxxxx/T/#m6ac04b44722dbc07839011816e94fadf5ad6794e
>
> Would a similar approach work for TDX? At least WRT cleaning up pages
> before returning them back to the host? If we could isolate that
> requirement/handling from all the other aspects of invalidations it
> really seems like it would cause us less headaches down the road.
In short, TDX KVM doesn't need an extra callback for invalidating/freeing gmem
pages. kvm_unmap_gfn_range() callback works. Instead TDX needs attributes
(private-or-shared) for it.
The reason is, TDX uses encrypted page table for guest (We call it Secure-EPT),
and decicated operation on it. The TDX KVM mmu hooks TDP MMU operations.
In in the case of invalidating/releasing page, it eventually hooks
handle_removed_pt() for additional operation.
Because TDX simply won't use gmem_invalidate callback, I'm fine with it.
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@xxxxxxxxx>
> > ---
> > include/linux/kvm_host.h | 11 ++++++++++-
> > virt/kvm/guest_mem.c | 10 +++++++---
> > virt/kvm/kvm_main.c | 4 +++-
> > 3 files changed, 20 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 1a47cedae8a1..c049c0aa44d6 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -256,12 +256,21 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> > #endif
> >
> > #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
> > +
> > +#define KVM_GFN_RANGE_FLAGS_SET_MEM_ATTR BIT(0)
>
> Can you go into more detail on why special handling is needed for
> SET_MEM_ATTR?
When in TDX, the VMM zaps a private page from the encrypted page table and the
VMM adds the page back to the same GPA, it results in zeroing page and guest
needs to accept the page again. When converting a page from shared to private,
KVM needs to zap only shared pages. So the callback needs to know this zap
is for converting shared-to-private or private-to-shared.
> > +#define KVM_GFN_RANGE_FLAGS_GMEM_PUNCH_HOLE BIT(1)
> > +#define KVM_GFN_RANGE_FLAGS_GMEM_RELEASE BIT(2)
>
> Would the need to distinguish between PUNCH_HOLE/RELEASE go away in the
> TDX case if you take the above approach? For SNP, the answer is yes. If
> that's also the case for TDX I think that would be another argument in
> favor of decoupling these from existing KVM MMU invalidation path.
TDX doesn't need gmem_invalidate callback. TDx doesn't need the difference
betwee punch hole and release. So in short TDX needs
KVM_GFN_RANGE_FLAGS_SET_MEM_ATTR and KVM_GFN_RANGE_FLAGS_GMEM.
With KVM_GFN_RANGE_FLAGS_GMEM, the callback can know that invalidating private
pages. Maybe by (ab)using KVM_GFN_RANGE_FLAGS_SET_MEM_ATTR(attr=shared),
KVM_GFN_RANGE_FLAGS_GMEM can be dropped.
Thanks,
Isaku Yamahata
--
Isaku Yamahata <isaku.yamahata@xxxxxxxxx>