Re: [RFC PATCH 0/5] KVM: guest_memfd: support for uffd missing
From: James Houghton
Date: Wed Mar 05 2025 - 14:36:21 EST
On Mon, Mar 3, 2025 at 1:29 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
>
> On Mon, Mar 03, 2025 at 01:30:06PM +0000, Nikita Kalyazin wrote:
> > This series is built on top of the v3 write syscall support [1].
> >
> > With James's KVM userfault [2], it is possible to handle stage-2 faults
> > in guest_memfd in userspace. However, KVM itself also triggers faults
> > in guest_memfd in some cases, for example: PV interfaces like kvmclock,
> > PV EOI and page table walking code when fetching the MMIO instruction on
> > x86. It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3]
> > that KVM would be accessing those pages via userspace page tables. In
> > order for such faults to be handled in userspace, guest_memfd needs to
> > support userfaultfd.
> >
> > This series proposes a limited support for userfaultfd in guest_memfd:
> > - userfaultfd support is conditional to `CONFIG_KVM_GMEM_SHARED_MEM`
> > (as is fault support in general)
> > - Only `page missing` event is currently supported
> > - Userspace is supposed to respond to the event with the `write`
> > syscall followed by `UFFDIO_CONTINUE` ioctl to unblock the faulting
> > process. Note that we can't use `UFFDIO_COPY` here because
> > userfaulfd code does not know how to prepare guest_memfd pages, eg
> > remove them from direct map [4].
> >
> > Not included in this series:
> > - Proper interface for userfaultfd to recognise guest_memfd mappings
> > - Proper handling of truncation cases after locking the page
> >
> > Request for comments:
> > - Is it a sensible workflow for guest_memfd to resolve a userfault
> > `page missing` event with `write` syscall + `UFFDIO_CONTINUE`? One
> > of the alternatives is teaching `UFFDIO_COPY` how to deal with
> > guest_memfd pages.
>
> Probably not.. I don't see what protects a thread fault concurrently
> during write() happening, seeing partial data. Since you check the page
> cache it'll let it pass, but the partial page will be faulted in there.
+1 here.
I think the simplest way to make it work would be to also check
folio_test_uptodate() in the userfaultfd_missing() check[1]. It would
pair with kvm_gmem_mark_prepared() in the write() path[2].
I'm not sure if that's the "right" way, I think it would prevent
threads from reading data as it is written.
[1]: https://lore.kernel.org/kvm/20250303133011.44095-3-kalyazin@xxxxxxxxxx/
[2]: https://lore.kernel.org/kvm/20250303130838.28812-2-kalyazin@xxxxxxxxxx/
> I think we may need to either go with full MISSING or full MINOR traps.
I agree, and just to clarify: you've basically implemented the MISSING
model, just using write() to resolve userfaults instead of
UFFDIO_COPY. The UFFDIO_CONTINUE implementation you have isn't really
doing much; when the page cache has a page, the fault handler will
populate the PTE for you.
I think it's probably simpler to implement the MINOR model, where
userspace can populate the page cache however it wants; write() is
perfectly fine/natural. UFFDIO_CONTINUE just needs to populate PTEs
for gmem, and the fault handler needs to check for the presence of
PTEs. The `struct vm_fault` you have should contain enough info.
> One thing to mention is we probably need MINOR sooner or later to support
> gmem huge pages. The thing is for huge folios in gmem we can't rely on
> missing in page cache, as we always need to allocate in hugetlb sizes.
>
> > - What is a way forward to make userfaultfd code aware of guest_memfd?
> > I saw that Patrick hit a somewhat similar problem in [5] when trying
> > to use direct map manipulation functions in KVM and was pointed by
> > David at Elliot's guestmem library [6] that might include a shim for that.
> > Would the library be the right place to expose required interfaces like
> > `vma_is_gmem`?
>
> Not sure what's the best to do, but IIUC the current way this series uses
> may not work as long as one tries to reference a kvm symbol from core mm..
>
> One trick I used so far is leveraging vm_ops and provide hook function to
> report specialties when it's gmem. In general, I did not yet dare to
> overload vm_area_struct, but I'm thinking maybe vm_ops is more possible to
> be accepted. E.g. something like this:
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5e742738240c..b068bb79fdbc 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -653,8 +653,26 @@ struct vm_operations_struct {
> */
> struct page *(*find_special_page)(struct vm_area_struct *vma,
> unsigned long addr);
> + /*
> + * When set, return the allowed orders bitmask in faults of mmap()
> + * ranges (e.g. for follow up huge_fault() processing). Drivers
> + * can use this to bypass THP setups for specific types of VMAs.
> + */
> + unsigned long (*get_supported_orders)(struct vm_area_struct *vma);
> };
>
> +static inline bool vma_has_supported_orders(struct vm_area_struct *vma)
> +{
> + return vma->vm_ops && vma->vm_ops->get_supported_orders;
> +}
> +
> +static inline unsigned long vma_get_supported_orders(struct vm_area_struct *vma)
> +{
> + if (!vma_has_supported_orders(vma))
> + return 0;
> + return vma->vm_ops->get_supported_orders(vma);
> +}
> +
>
> In my case I used that to allow gmem report huge page supports on faults.
>
> Said that, above only existed in my own tree so far, so I also don't know
> whether something like that could be accepted (even if it'll work for you).
I think it might be useful to implement an fs-generic MINOR mode. The
fault handler is already easy enough to do generically (though it
would become more difficult to determine if the "MINOR" fault is
actually a MISSING fault, but at least for my userspace, the
distinction isn't important. :)) So the question becomes: what should
UFFDIO_CONTINUE look like?
And I think it would be nice if UFFDIO_CONTINUE just called
vm_ops->fault() to get the page we want to map and then mapped it,
instead of having shmem-specific and hugetlb-specific versions (though
maybe we need to keep the hugetlb specialization...). That would avoid
putting kvm/gmem/etc. symbols in mm/userfaultfd code.
I've actually wanted to do this for a while but haven't had a good
reason to pursue it. I wonder if it can be done in a
backwards-compatible fashion...