Re: [PATCH 1/6] KVM: guest_memfd: Add DEFAULT_SHARED flag, reject user page faults if not set

From: Sean Christopherson
Date: Wed Oct 01 2025 - 12:15:45 EST


On Wed, Oct 01, 2025, Vishal Annapurve wrote:
> On Mon, Sep 29, 2025 at 5:15 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> >
> > Oh! This got me looking at kvm_arch_supports_gmem_mmap() and thus
> > KVM_CAP_GUEST_MEMFD_MMAP. Two things:
> >
> > 1. We should change KVM_CAP_GUEST_MEMFD_MMAP into KVM_CAP_GUEST_MEMFD_FLAGS so
> > that we don't need to add a capability every time a new flag comes along,
> > and so that userspace can gather all flags in a single ioctl. If gmem ever
> > supports more than 32 flags, we'll need KVM_CAP_GUEST_MEMFD_FLAGS2, but
> > that's a non-issue relatively speaking.
> >
>
> Guest_memfd capabilities don't necessarily translate into flags, so ideally:
> 1) There should be two caps, KVM_CAP_GUEST_MEMFD_FLAGS and
> KVM_CAP_GUEST_MEMFD_CAPS.

I'm not saying we can't have another GUEST_MEMFD capability or three, all I'm
saying is that for enumerating what flags can be passed to KVM_CREATE_GUEST_MEMFD,
KVM_CAP_GUEST_MEMFD_FLAGS is a better fit than a one-off KVM_CAP_GUEST_MEMFD_MMAP.

> 2) IMO they should both support namespace of 64 values at least from the get go.

It's a limitation of KVM_CHECK_EXTENSION, and all of KVM's plumbing for ioctls.
Because KVM still supports 32-bit architectures, direct returns from ioctls are
forced to fit in 32-bit values to avoid unintentionally creating different ABI
for 32-bit vs. 64-bit kernels.

We could add KVM_CHECK_EXTENSION2 or KVM_CHECK_EXTENSION64 or something, but I
honestly don't see the point. The odds of guest_memfd supporting >32 flags is
small, and the odds of that happening in the next ~5 years is basically zero.
All so that userspace can make one syscall instead of two for a path that isn't
remotely performance critical.

So while I agree that being able to enumerate 64 flags from the get-go would be
nice to have, it's simply not worth the effort (unless someone has a clever idea).

> 3) The reservation scheme for upstream should ideally be LSB's first
> for the new caps/flags.

We're getting way ahead of ourselves. Nothing needs KVM_CAP_GUEST_MEMFD_CAPS at
this time, so there's nothing to discuss.

> guest_memfd will achieve multiple features in future, both upstream
> and in out-of-tree versions to deploy features before they make their

When it comes to upstream uAPI and uABI, out-of-tree kernel code is irrelevant.

> way upstream. Generally the scheme followed by out-of-tree versions is
> to define a custom UAPI that won't conflict with upstream UAPIs in
> near future. Having a namespace of 32 values gives little space to
> avoid the conflict, e.g. features like hugetlb support will have to
> eat up at least 5 bits from the flags [1].

Why on earth would out-of-tree code use KVM_CAP_GUEST_MEMFD_FLAGS? Providing
infrastructure to support an infinite (quite literally) number of out-of-tree
capabilities and sub-ioctls, with practically zero chance of conflict, is not
difficult. See internal b/378111418.

But as above, this is not upstream's problem to solve.

> [1] https://elixir.bootlin.com/linux/v6.17/source/include/uapi/asm-generic/hugetlb_encode.h#L20