Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings

From: Kirill A. Shutemov
Date: Tue Oct 29 2019 - 02:43:22 EST

On Mon, Oct 28, 2019 at 10:43:51PM -0700, Dan Williams wrote:
> On Mon, Oct 28, 2019 at 6:16 AM Kirill A. Shutemov <kirill@xxxxxxxxxxxxx> wrote:
> >
> > On Mon, Oct 28, 2019 at 02:00:19PM +0100, Mike Rapoport wrote:
> > > On Mon, Oct 28, 2019 at 03:31:24PM +0300, Kirill A. Shutemov wrote:
> > > > On Sun, Oct 27, 2019 at 12:17:32PM +0200, Mike Rapoport wrote:
> > > > > From: Mike Rapoport <rppt@xxxxxxxxxxxxx>
> > > > >
> > > > > The mappings created with MAP_EXCLUSIVE are visible only in the context of
> > > > > the owning process and can be used by applications to store secret
> > > > > information that will not be visible not only to other processes but to the
> > > > > kernel as well.
> > > > >
> > > > > The pages in these mappings are removed from the kernel direct map and
> > > > > marked with PG_user_exclusive flag. When the exclusive area is unmapped,
> > > > > the pages are mapped back into the direct map.
> > > >
> > > > I probably blind, but I don't see where you manipulate direct map...
> > >
> > > __get_user_pages() calls __set_page_user_exclusive() which in turn calls
> > > set_direct_map_invalid_noflush() that makes the page not present.
> >
> > Ah. okay.
> >
> > I think active use of this feature will lead to performance degradation of
> > the system with time.
> >
> > Setting a single 4k page non-present in the direct mapping will require
> > splitting 2M or 1G page we usually map direct mapping with. And it's one
> > way road. We don't have any mechanism to map the memory with huge page
> > again after the application has freed the page.
> >
> > It might be okay if all these pages cluster together, but I don't think we
> > have a way to achieve it easily.
> Still, it would be worth exploring what that would look like if not
> for MAP_EXCLUSIVE then set_mce_nospec() that wants to punch out poison
> pages from the direct map. In the case of pmem, where those pages are
> able to be repaired, it would be nice to also repair the mapping
> granularity of the direct map.

The solution has to consist of two parts: finding a range to collapse and
actually collapsing the range into a huge page.

Finding the collapsible range will likely require background scanning of
the direct mapping as we do for THP with khugepaged. It should not too
hard, but likely require long and tedious tuning to be effective, but not
too disturbing for the system.

Alternatively, after any changes to the direct mapping, we can initiate
checking if the range is collapsible. Up to 1G around the changed 4k.
It might be more taxing than scanning if direct mapping changes often.

Collapsing itself appears to be simple: re-check if the range is
collapsible under the lock, replace the page table with the huge page and
flush the TLB.

But some CPUs don't like to have two TLB entries for the same memory with
different sizes at the same time. See for instance AMD erratum 383.

Getting it right would require making the range not present, flush TLB and
only then install huge page. That's what we do for userspace.

It will not fly for the direct mapping. There is no reasonable way to
exclude other CPU from accessing the range while it's not present (call
stop_machine()? :P). Moreover, the range may contain the code that doing
the collapse or data required for it...

BTW, looks like current __split_large_page() in pageattr.c is susceptible
to the errata. Maybe we can get away with the easy way...

Kirill A. Shutemov