Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory
From: Huang, Kai
Date: Wed Dec 21 2022 - 19:37:43 EST
On Wed, 2022-12-21 at 21:39 +0800, Chao Peng wrote:
> > On Tue, Dec 20, 2022 at 08:33:05AM +0000, Huang, Kai wrote:
> > > > On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote:
> > > > > > On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote:
> > > > > > > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > [...]
> > > > > > > > > > > >
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > + /*
> > > > > > > > > > > > > > + * These pages are currently unmovable so don't place them into
> > > > > > > > > > > > > > movable
> > > > > > > > > > > > > > + * pageblocks (e.g. CMA and ZONE_MOVABLE).
> > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > + mapping = memfd->f_mapping;
> > > > > > > > > > > > > > + mapping_set_unevictable(mapping);
> > > > > > > > > > > > > > + mapping_set_gfp_mask(mapping,
> > > > > > > > > > > > > > + mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> > > > > > > > > > > >
> > > > > > > > > > > > But, IIUC removing __GFP_MOVABLE flag here only makes page allocation from
> > > > > > > > > > > > non-
> > > > > > > > > > > > movable zones, but doesn't necessarily prevent page from being migrated. My
> > > > > > > > > > > > first glance is you need to implement either a_ops->migrate_folio() or just
> > > > > > > > > > > > get_page() after faulting in the page to prevent.
> > > > > > > > > >
> > > > > > > > > > The current api restrictedmem_get_page() already does this, after the
> > > > > > > > > > caller calling it, it holds a reference to the page. The caller then
> > > > > > > > > > decides when to call put_page() appropriately.
> > > > > > > >
> > > > > > > > I tried to dig some history. Perhaps I am missing something, but it seems Kirill
> > > > > > > > said in v9 that this code doesn't prevent page migration, and we need to
> > > > > > > > increase page refcount in restrictedmem_get_page():
> > > > > > > >
> > > > > > > > https://lore.kernel.org/linux-mm/20221129112139.usp6dqhbih47qpjl@xxxxxxxxxxxxxxxxx/
> > > > > > > >
> > > > > > > > But looking at this series it seems restrictedmem_get_page() in this v10 is
> > > > > > > > identical to the one in v9 (except v10 uses 'folio' instead of 'page')?
> > > > > >
> > > > > > restrictedmem_get_page() increases page refcount several versions ago so
> > > > > > no change in v10 is needed. You probably missed my reply:
> > > > > >
> > > > > > https://lore.kernel.org/linux-mm/20221129135844.GA902164@xxxxxxxxxxxxxxxxxx/
> > > >
> > > > But for non-restricted-mem case, it is correct for KVM to decrease page's
> > > > refcount after setting up mapping in the secondary mmu, otherwise the page will
> > > > be pinned by KVM for normal VM (since KVM uses GUP to get the page).
> >
> > That's true. Actually even true for restrictedmem case, most likely we
> > will still need the kvm_release_pfn_clean() for KVM generic code. On one
> > side, other restrictedmem users like pKVM may not require page pinning
> > at all. On the other side, see below.
OK. Agreed.
> >
> > > >
> > > > So what we are expecting is: for KVM if the page comes from restricted mem, then
> > > > KVM cannot decrease the refcount, otherwise for normal page via GUP KVM should.
> >
> > I argue that this page pinning (or page migration prevention) is not
> > tied to where the page comes from, instead related to how the page will
> > be used. Whether the page is restrictedmem backed or GUP() backed, once
> > it's used by current version of TDX then the page pinning is needed. So
> > such page migration prevention is really TDX thing, even not KVM generic
> > thing (that's why I think we don't need change the existing logic of
> > kvm_release_pfn_clean()).
> >
This essentially boils down to who "owns" page migration handling, and sadly,
page migration is kinda "owned" by the core-kernel, i.e. KVM cannot handle page
migration by itself -- it's just a passive receiver.
For normal pages, page migration is totally done by the core-kernel (i.e. it
unmaps page from VMA, allocates a new page, and uses migrate_pape() or a_ops-
>migrate_page() to actually migrate the page).
In the sense of TDX, conceptually it should be done in the same way. The more
important thing is: yes KVM can use get_page() to prevent page migration, but
when KVM wants to support it, KVM cannot just remove get_page(), as the core-
kernel will still just do migrate_page() which won't work for TDX (given
restricted_memfd doesn't have a_ops->migrate_page() implemented).
So I think the restricted_memfd filesystem should own page migration handling,
(i.e. by implementing a_ops->migrate_page() to either just reject page migration
or somehow support it).
To support page migration, it may require KVM's help in case of TDX (the
TDH.MEM.PAGE.RELOCATE SEAMCALL requires "GPA" and "level" of EPT mapping, which
are only available in KVM), but that doesn't make KVM to own the handling of
page migration.
> > Wouldn't better to let TDX code (or who
> > requires that) to increase/decrease the refcount when it populates/drops
> > the secure EPT entries? This is exactly what the current TDX code does:
> >
> > get_page():
> > https://github.com/intel/tdx/blob/kvm-upstream/arch/x86/kvm/vmx/tdx.c#L1217
> >
> > put_page():
> > https://github.com/intel/tdx/blob/kvm-upstream/arch/x86/kvm/vmx/tdx.c#L1334
> >
As explained above, I think doing so in KVM is wrong: it can prevent by using
get_page(), but you cannot simply remove it to support page migration.
Sean also said similar thing when reviewing v8 KVM TDX series and I also agree:
https://lore.kernel.org/lkml/Yvu5PsAndEbWKTHc@xxxxxxxxxx/
https://lore.kernel.org/lkml/31fec1b4438a6d9bb7ff719f96caa8b23ed764d6.camel@xxxxxxxxx/