Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings

From: Mike Rapoport
Date: Wed Oct 30 2019 - 04:40:17 EST


On Tue, Oct 29, 2019 at 10:00:55AM -0700, Andy Lutomirski wrote:
> On Tue, Oct 29, 2019 at 2:33 AM Mike Rapoport <rppt@xxxxxxxxxx> wrote:
> >
> > On Mon, Oct 28, 2019 at 02:44:23PM -0600, Andy Lutomirski wrote:
> > >
> > > > On Oct 27, 2019, at 4:17 AM, Mike Rapoport <rppt@xxxxxxxxxx> wrote:
> > > >
> > > > ïFrom: Mike Rapoport <rppt@xxxxxxxxxxxxx>
> > > >
> > > > Hi,
> > > >
> > > > The patch below aims to allow applications to create mappins that have
> > > > pages visible only to the owning process. Such mappings could be used to
> > > > store secrets so that these secrets are not visible neither to other
> > > > processes nor to the kernel.
> > > >
> > > > I've only tested the basic functionality, the changes should be verified
> > > > against THP/migration/compaction. Yet, I'd appreciate early feedback.
> > >
> > > Iâve contemplated the concept a fair amount, and I think you should
> > > consider a change to the API. In particular, rather than having it be a
> > > MAP_ flag, make it a chardev. You can, at least at first, allow only
> > > MAP_SHARED, and admins can decide who gets to use it. It might also play
> > > better with the VM overall, and you wonât need a VM_ flag for it â you
> > > can just wire up .fault to do the right thing.
> >
> > I think mmap()/mprotect()/madvise() are the natural APIs for such
> > interface.
>
> Then you have a whole bunch of questions to answer. For example:
>
> What happens if you mprotect() or similar when the mapping is already
> in use in a way that's incompatible with MAP_EXCLUSIVE?

Then we refuse to mprotect()? Like in any other case when vm_flags are not
compatible with required madvise()/mprotect() operation.

> Is it actually reasonable to malloc() some memory and then make it exclusive?
>
> Are you permitted to map a file MAP_EXCLUSIVE? What does it mean?

I'd limit MAP_EXCLUSIVE only to anonymous memory.

> What does MAP_PRIVATE | MAP_EXCLUSIVE do?

My preference is to have only mmap() and then the semantics is more clear:

MAP_PRIVATE | MAP_EXCLUSIVE creates a pre-populated region, marks it locked
and drops the pages in this region from the direct map.
The pages are returned back on munmap().
Then there is no way to change an existing area to be exclusive or vice
versa.

> How does one pass exclusive memory via SCM_RIGHTS? (If it's a
> memfd-like or chardev interface, it's trivial. mmap(), not so much.)

Why passing such memory via SCM_RIGHTS would be useful?

> And finally, there's my personal giant pet peeve: a major use of this
> will be for virtualization. I suspect that a lot of people would like
> the majority of KVM guest memory to be unmapped from the host
> pagetables. But people might also like for guest memory to be
> unmapped in *QEMU's* pagetables, and mmap() is a basically worthless
> interface for this. Getting fd-backed memory into a guest will take
> some possibly major work in the kernel, but getting vma-backed memory
> into a guest without mapping it in the host user address space seems
> much, much worse.

Well, in my view, the MAP_EXCLUSIVE is intended to keep small secrets
rather than use it for the entire guest memory. I even considered adding a
limit for the mapping size, but then I decided that since RLIMIT_MEMLOCK is
anyway enforced there is no need for a new one.

I agree that getting fd-backed memory into a guest would be less pain that
VMA, but KVM can already use memory outside the control of the kernel via
/dev/map [1].

So unless I'm missing something here, there is no need to use MAP_EXCLUSIVE
for the guest memory.

[1] https://lwn.net/Articles/778240/

> > Switching to a chardev doesn't solve the major problem of direct
> > map fragmentation and defeats the ability to use exclusive memory mappings
> > with the existing allocators, while mprotect() and madvise() do not.
> >
>
> Will people really want to do malloc() and then remap it exclusive?
> This sounds dubiously useful at best.

Again, my preference is to have mmap() only, but I see a value in this use
case as well. Application developers allocate memory and then sometimes
change its properties rather than go mmap() something. For such usage
mprotect() may be usefull.


--
Sincerely yours,
Mike.