Re: [ANNOUNCE] PUCK Agenda - 2024.08.07 - KVM userfault (guest_memfd/HugeTLB postcopy)
From: Sean Christopherson
Date: Wed Aug 07 2024 - 20:17:58 EST
On Wed, Aug 07, 2024, James Houghton wrote:
> On Thu, Aug 1, 2024 at 3:44 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> >
> > Early warning for next week's PUCK since there's actually a topic this time.
> > James is going to lead a discussion on KVM userfault[*](name subject to change).
>
> Thanks for attending, everyone!
>
> We seemed to arrive at the following conclusions:
>
> 1. For guest_memfd, stage 2 mapping installation will never go through
> GUP / virtual addresses to do the GFN --> PFN translation, including
> when it supports non-private memory.
> 2. Something like KVM Userfault is indeed necessary to handle
> post-copy for guest_memfd VMs, especially when guest_memfd supports
> non-private memory.
> 3. We should not hook into the overall GFN --> HVA translation, we
> should only be hooking the GFN --> PFN translation steps to figure out
> how to create stage 2 mappings. That is, KVM's own accesses to guest
> memory should just go through mm/userfaultfd.
> 4. We don't need the concept of "async userfaults" (making KVM block
> when attempting to access userfault memory) in KVM Userfault.
>
> So I need to think more about what exactly the API should look like
> for controlling if a page should exit to userspace before KVM is
> allowed to map it into stage 2 and if this should apply to all of
> guest memory or only guest_memfd.
>
> It sounds like it may most likely be something like a per-VM bitmap
> that describes which pages are allowed to be mapped into stage 2,
> applying to all memory, not just guest_memfd memory. Even though it is
> solving a problem for guest_memfd specifically, it is slightly cleaner
> to have it apply to all memory.
>
> If this per-VM bitmap applies to all memory, then we don't need to
> wait for guest_memfd to support non-private memory before working on a
> full implementation. But if not, perhaps it makes sense to wait.
Per-memslot likely makes more sense. Unlike attributes, the bitmap only needs
to exist during post-copy, and unless we do something clever, i.e. use something
other than a bitmap, the bitmap needs to be fully allocated, which would result
in unnecessary overhead if there are gaps in guest physical memory.
The other hiccup with a per-VM bitmap is that it would force us to define ABI
for things we don't care about. E.g. what happens if the local APIC is in-kernel
and userspace marks the APIC page as USERFAULT? Ditto for gfns without memslots.
E.g. add a KVM_MEM_USERFAULT flag along with a userfault_bitmap user pointer
that is valid when the flag is set. Unlike dirty logging, KVM is only a reader
of the bitmap, so I'm pretty sure we don't need a copy in KVM.
When userspace creates the VM on the target, it allocates a bitmap for each
memslot and sets KVM_MEM_USERFAULT. When migration completes, userspace clears
KVM_MEM_USERFAULT for each memslot, and then deletes the associated bitmap.