Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings

From: Dan Williams
Date: Tue Oct 29 2019 - 15:44:07 EST


On Mon, Oct 28, 2019 at 11:43 PM Kirill A. Shutemov
<kirill@xxxxxxxxxxxxx> wrote:
>
> On Mon, Oct 28, 2019 at 10:43:51PM -0700, Dan Williams wrote:
> > On Mon, Oct 28, 2019 at 6:16 AM Kirill A. Shutemov <kirill@xxxxxxxxxxxxx> wrote:
> > >
> > > On Mon, Oct 28, 2019 at 02:00:19PM +0100, Mike Rapoport wrote:
> > > > On Mon, Oct 28, 2019 at 03:31:24PM +0300, Kirill A. Shutemov wrote:
> > > > > On Sun, Oct 27, 2019 at 12:17:32PM +0200, Mike Rapoport wrote:
> > > > > > From: Mike Rapoport <rppt@xxxxxxxxxxxxx>
> > > > > >
> > > > > > The mappings created with MAP_EXCLUSIVE are visible only in the context of
> > > > > > the owning process and can be used by applications to store secret
> > > > > > information that will not be visible not only to other processes but to the
> > > > > > kernel as well.
> > > > > >
> > > > > > The pages in these mappings are removed from the kernel direct map and
> > > > > > marked with PG_user_exclusive flag. When the exclusive area is unmapped,
> > > > > > the pages are mapped back into the direct map.
> > > > >
> > > > > I probably blind, but I don't see where you manipulate direct map...
> > > >
> > > > __get_user_pages() calls __set_page_user_exclusive() which in turn calls
> > > > set_direct_map_invalid_noflush() that makes the page not present.
> > >
> > > Ah. okay.
> > >
> > > I think active use of this feature will lead to performance degradation of
> > > the system with time.
> > >
> > > Setting a single 4k page non-present in the direct mapping will require
> > > splitting 2M or 1G page we usually map direct mapping with. And it's one
> > > way road. We don't have any mechanism to map the memory with huge page
> > > again after the application has freed the page.
> > >
> > > It might be okay if all these pages cluster together, but I don't think we
> > > have a way to achieve it easily.
> >
> > Still, it would be worth exploring what that would look like if not
> > for MAP_EXCLUSIVE then set_mce_nospec() that wants to punch out poison
> > pages from the direct map. In the case of pmem, where those pages are
> > able to be repaired, it would be nice to also repair the mapping
> > granularity of the direct map.
>
> The solution has to consist of two parts: finding a range to collapse and
> actually collapsing the range into a huge page.
>
> Finding the collapsible range will likely require background scanning of
> the direct mapping as we do for THP with khugepaged. It should not too
> hard, but likely require long and tedious tuning to be effective, but not
> too disturbing for the system.
>
> Alternatively, after any changes to the direct mapping, we can initiate
> checking if the range is collapsible. Up to 1G around the changed 4k.
> It might be more taxing than scanning if direct mapping changes often.
>
> Collapsing itself appears to be simple: re-check if the range is
> collapsible under the lock, replace the page table with the huge page and
> flush the TLB.
>
> But some CPUs don't like to have two TLB entries for the same memory with
> different sizes at the same time. See for instance AMD erratum 383.

That basic description would seem to defeat most (all?) interesting
huge page use cases. For example dax makes no attempt to make sure
aliased mappings of pmem are the same size between the direct map that
the driver uses, and userspace dax mappings. So I assume there are
more details than "all aliased mappings must be the same size".

> Getting it right would require making the range not present, flush TLB and
> only then install huge page. That's what we do for userspace.
>
> It will not fly for the direct mapping. There is no reasonable way to
> exclude other CPU from accessing the range while it's not present (call
> stop_machine()? :P). Moreover, the range may contain the code that doing
> the collapse or data required for it...

At least for pmem all the access points can be controlled. pmem is
never used for kernel text at least in the dax mode where it is
accessed via file-backed shared mappings, or the pmem driver. So when
I say "direct-map repair" I mean the incidental direct-map that pmem
uses since it maps pmem with arch_add_memory(), not the typical DRAM
direct-map that may house kernel text. Poison consumed from the kernel
DRAM direct-map is fatal, poison consumed from dax mappings and the
pmem driver path is recoverable and repairable.