Re: [RFC PATCH kernel] iommufd: Allow mapping from KVM's guest_memfd

From: Sean Christopherson

Date: Thu Feb 26 2026 - 17:41:05 EST

On Thu, Feb 26, 2026, Jason Gunthorpe wrote:
> On Thu, Feb 26, 2026 at 12:19:52AM -0800, Ackerley Tng wrote:
> > Sean Christopherson <seanjc@xxxxxxxxxx> writes:
> >
> > > On Wed, Feb 25, 2026, Alexey Kardashevskiy wrote:
> > >> For the new guest_memfd type, no additional reference is taken as
> > >> pinning is guaranteed by the KVM guest_memfd library.
> > >>
> > >> There is no KVM-GMEMFD->IOMMUFD direct notification mechanism as
> > >> the assumption is that:
> > >> 1) page stage change events will be handled by VMM which is going
> > >> to call IOMMUFD to remap pages;
> > >> 2) shrinking GMEMFD equals to VM memory unplug and VMM is going to
> > >> handle it.
> > >
> > > The VMM is outside of the kernel's effective TCB. Assuming the VMM will always
> > > do the right thing is a non-starter.
> >
> > I think looking up the guest_memfd file from the userspace address
> > (uptr) is a good start
>
> Please no, if we need complicated things like notifiers then it is
> better to start directly with the struct file interface and get
> immediately into some guestmemfd API instead of trying to get their
> from a VMA. A VMA doesn't help in any way and just complicates things.

+1000. Anything that _requires_ a VMA to do something with guest_memfd is broken
by design.

> > I didn't think of this before LPC but forcing unmapping during
> > truncation (aka shrinking guest_memfd) is probably necessary for overall
> > system stability and correctness, so notifying and having guest_memfd
> > track where its pages were mapped in the IOMMU is necessary. Whether or
> > not to unmap during conversions could be a arch-specific thing, but all
> > architectures would want the memory unmapped if the memory is removed
> > from guest_memfd ownership.
>
> Things like truncate are a bit easier to handle, you do need a
> protective notifier, but if it detects truncate while an iommufd area
> still covers the truncated region it can just revoke the whole
> area. Userspace made a mistake and gets burned but the kernel is
> safe. We don't need something complicated kernel side to automatically
> handle removing just the slice of truncated guestmemfd, for example.

Yeah, as long as the behavior is well-documented from time zero, we can probably
get away with fairly draconian behavior.

> If guestmemfd is fully pinned and cannot free memory outside of
> truncate that may be good enough (though somehow I think that is not
> the case)

With in-place conversion, PUNCH_HOLE and private=>shared conversions are the only
two ways to partial "remove" memory from guest_memfd, so it may really be that
simple.