Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory

From: Andy Lutomirski
Date: Wed Sep 01 2021 - 13:10:22 EST




On Wed, Sep 1, 2021, at 9:16 AM, David Hildenbrand wrote:
> On 01.09.21 17:54, Andy Lutomirski wrote:
> > On Wed, Sep 1, 2021, at 1:09 AM, David Hildenbrand wrote:
> >>>> Do we have to protect from that? How would KVM protect from user space
> >>>> replacing private pages by shared pages in any of the models we discuss?
> >>>
> >>> The overarching rule is that KVM needs to guarantee a given pfn is never mapped[*]
> >>> as both private and shared, where "shared" also incorporates any mapping from the
> >>> host. Essentially it boils down to the kernel ensuring that a pfn is unmapped
> >>> before it's converted to/from private, and KVM ensuring that it honors any
> >>> unmap notifications from the kernel, e.g. via mmu_notifier or via a direct callback
> >>> as proposed in this RFC.
> >>
> >> Okay, so the fallocate(PUNCHHOLE) from user space could trigger the
> >> respective unmapping and freeing of backing storage.
> >>
> >>>
> >>> As it pertains to PUNCH_HOLE, the responsibilities are no different than when the
> >>> backing-store is destroyed; the backing-store needs to notify downstream MMUs
> >>> (a.k.a. KVM) to unmap the pfn(s) before freeing the associated memory.
> >>
> >> Right.
> >>
> >>>
> >>> [*] Whether or not the kernel's direct mapping needs to be removed is debatable,
> >>> but my argument is that that behavior is not visible to userspace and thus
> >>> out of scope for this discussion, e.g. zapping/restoring the direct map can
> >>> be added/removed without impacting the userspace ABI.
> >>
> >> Right. Removing it shouldn't also be requited IMHO. There are other ways
> >> to teach the kernel to not read/write some online pages (filter
> >> /proc/kcore, disable hibernation, strict access checks for /dev/mem ...).
> >>
> >>>
> >>>>>> Define "ordinary" user memory slots as overlay on top of "encrypted" memory
> >>>>>> slots. Inside KVM, bail out if you encounter such a VMA inside a normal
> >>>>>> user memory slot. When creating a "encryped" user memory slot, require that
> >>>>>> the whole VMA is covered at creation time. You know the VMA can't change
> >>>>>> later.
> >>>>>
> >>>>> This can work for the basic use cases, but even then I'd strongly prefer not to
> >>>>> tie memslot correctness to the VMAs. KVM doesn't truly care what lies behind
> >>>>> the virtual address of a memslot, and when it does care, it tends to do poorly,
> >>>>> e.g. see the whole PFNMAP snafu. KVM cares about the pfn<->gfn mappings, and
> >>>>> that's reflected in the infrastructure. E.g. KVM relies on the mmu_notifiers
> >>>>> to handle mprotect()/munmap()/etc...
> >>>>
> >>>> Right, and for the existing use cases this worked. But encrypted memory
> >>>> breaks many assumptions we once made ...
> >>>>
> >>>> I have somewhat mixed feelings about pages that are mapped into $WHATEVER
> >>>> page tables but not actually mapped into user space page tables. There is no
> >>>> way to reach these via the rmap.
> >>>>
> >>>> We have something like that already via vfio. And that is fundamentally
> >>>> broken when it comes to mmu notifiers, page pinning, page migration, ...
> >>>
> >>> I'm not super familiar with VFIO internals, but the idea with the fd-based
> >>> approach is that the backing-store would be in direct communication with KVM and
> >>> would handle those operations through that direct channel.
> >>
> >> Right. The problem I am seeing is that e.g., try_to_unmap() might not be
> >> able to actually fully unmap a page, because some non-synchronized KVM
> >> MMU still maps a page. It would be great to evaluate how the fd
> >> callbacks would fit into the whole picture, including the current rmap.
> >>
> >> I guess I'm missing the bigger picture how it all fits together on the
> >> !KVM side.
> >
> > The big picture is fundamentally a bit nasty. Logically (ignoring the implementation details of rmap, mmu_notifier, etc), you can call try_to_unmap and end up with a page that is Just A Bunch Of Bytes (tm). Then you can write it to disk, memcpy to another node, compress it, etc. When it gets faulted back in, you make sure the same bytes end up somewhere and put the PTEs back.
> >
> > With guest-private memory, this doesn't work. Forget about the implementation: you simply can't take a page of private memory, quiesce it so the guest can't access it without faulting, and turn it into Just A Bunch Of Bytes. TDX does not have that capability. (Any system with integrity-protected memory won't without having >PAGE_SIZE bytes or otherwise storing metadata, but TDX can't do this at all.) SEV-ES *can* (I think -- I asked the lead architect), but that's not the same thing as saying it's a good idea.
> >
> > So if you want to migrate a TDX page from one NUMA node to another, you need to do something different (I don't know all the details), you will have to ask Intel to explain how this might work in the future (it wasn't in the public docs last time I looked), but I'm fairly confident that it does not resemble try_to_unmap().
> >
> > Even on SEV, where a page *can* be transformed into a Just A Bunch Of Bytes, the operation doesn't really look like try_to_unmap(). As I understand it, it's more of:
> >
> > look up the one-and-only guest and GPA at which this page is mapped.
> > tear down the NPT PTE. (SEV, unlike TDX, uses paging structures in normal memory.)
> > Ask the magic SEV mechanism to turn the page into swappable data
> > Go to town.
> >
> > This doesn't really resemble the current rmap + mmu_notifier code, and shoehorning this into mmu_notifier seems like it may be uglier and more code than implementing it more directly in the backing store.
> >
> > If you actually just try_to_unmap() a SEV-ES page (and it worked, which it currently won't), you will have data corruption or cache incoherency.
> >
> > If you want to swap a page on TDX, you can't. Sorry, go directly to jail, do not collect $200.
> >
> > So I think there are literally zero code paths that currently call try_to_unmap() that will actually work like that on TDX. If we run out of memory on a TDX host, we can kill the guest completely and reclaim all of its memory (which probably also involves killing QEMU or whatever other user program is in charge), but that's really our only option.
>
> try_to_unmap() would actually do the right thing I think. It's the users
> that access page content (migration, swapping) that need additional care
> to special-case these pages. Completely agree that these would be broken.

I suspect that the eventual TDX numa migration flow will involve asking the TD module to relocate the page without unmapping it first. TDX doesn’t really have a nondestructive unmap operation.

>
>
> --
> Thanks,
>
> David / dhildenb
>
>