Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

From: Fuad Tabba
Date: Fri Sep 23 2022 - 11:21:17 EST


Hi,

<...>

> > Regarding pKVM's use case, with the shim approach I believe this can be done by
> > allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions
> > piled on top.
> >
> > My first thought was to make the uAPI a set of KVM ioctls so that KVM
> > could tightly
> > tightly control usage without taking on too much complexity in the
> > kernel, but
> > working through things, routing the behavior through the shim itself
> > might not be
> > all that horrific.
> >
> > IIRC, we discarded the idea of allowing userspace to map the "private"
> > fd because
> > things got too complex, but with the shim it doesn't seem _that_ bad.
>
> What's the exact use case? Is it just to pre-populate the memory?

Prepopulate memory and access memory that could go back and forth from
being shared to being private.

Cheers,
/fuad



> >
> > E.g. on the memfd side:
> >
> > 1. The entire memfd must be mapped, and at most one mapping is allowed, i.e.
> > mapping is all or nothing.
> >
> > 2. Acquiring a reference via get_pfn() is disallowed if there's a mapping for
> > the restricted memfd.
> >
> > 3. Add notifier hooks to allow downstream users to further restrict things.
> >
> > 4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything in
> > one shot.
> >
> > 5. Require that there are no outstanding references at munmap(). Or if this
> > can't be guaranteed by userspace, maybe add some way for userspace to wait
> > until it's ok to convert to private? E.g. so that get_pfn() doesn't need
> > to do an expensive check every time.
>
> Hmm. I haven't looked at the code to see if this would really work, but I think this could be done more in line with how the rest of the kernel works by using the rmap infrastructure. When the pKVM memfd is in not-yet-private mode, just let it be mmapped as usual (but don't allow any form of GUP or pinning). Then have an ioctl to switch to to shared mode that takes locks or sets flags so that no new faults can be serviced and does unmap_mapping_range.
>
> As long as the shim arranges to have its own vm_ops, I don't immediately see any reason this can't work.