Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings

From: Sean Christopherson
Date: Mon Oct 28 2019 - 13:34:14 EST

On Mon, Oct 28, 2019 at 10:12:44AM -0700, Dave Hansen wrote:
> On 10/27/19 3:17 AM, Mike Rapoport wrote:
> > The pages in these mappings are removed from the kernel direct map and
> > marked with PG_user_exclusive flag. When the exclusive area is unmapped,
> > the pages are mapped back into the direct map.
> This looks fun. It's certainly simple.
> But, the description is not really calling out the pros and cons very
> well. I'm also not sure that folks will use an interface like this that
> requires up-front, special code to do an allocation instead of something
> like madvise(). That's why protection keys ended up the way it did: if
> you do this as a mmap() replacement, you need to modify all *allocators*
> to be enabled for this. If you do it with mprotect()-style, you can
> apply it to existing allocations.
> Some other random thoughts:
> * The page flag is probably not a good idea. It would be probably
> better to set _PAGE_SPECIAL on the PTE and force get_user_pages()
> into the slow path.
> * This really stops being "normal" memory. You can't do futexes on it,
> cant splice it. Probably need a more fleshed-out list of
> incompatible features.
> * As Kirill noted, each 4k page ends up with a potential 1GB "blast
> radius" of demoted pages in the direct map. Not cool. This is
> probably a non-starter as it stands.
> * The global TLB flushes are going to eat you alive. They probably
> border on a DoS on larger systems.
> * Do we really want this user interface to dictate the kernel
> implementation? In other words, do we really want MAP_EXCLUSIVE,
> or do we want MAP_SECRET? One tells the kernel what do *do*, the
> other tells the kernel what the memory *IS*.

If we go that route, maybe MAP_USER_SECRET so that there's wiggle room in
the event that there are different secret keepers that require different
implementations in the kernel? E.g. MAP_GUEST_SECRET for a KVM guest to
take the userspace VMM (Qemu) out of the TCB, i.e. the mapping would be
accessible by the kernel (or just KVM?) and the KVM guest, but not

> * There's a lot of other stuff going on in this area: XPFO, SEV, MKTME,
> Persistent Memory, where the kernel direct map is a liability in some
> way. We probably need some kind of overall, architected solution
> rather than five or ten things all poking at the direct map.