Re: [PATCH v6 1/4] KVM: mmu: introduce new gfn_to_pfn_noref functions

From: Sean Christopherson
Date: Wed May 24 2023 - 14:29:59 EST

On Wed, May 24, 2023, Peter Xu wrote:
> On Wed, May 24, 2023 at 09:46:13AM -0700, Sean Christopherson wrote:
> > If we hack kvm_pfn_to_refcounted_page(), then all of those protections are lost
> > because KVM would drop its assertions and also skip dirtying pages, i.e. would
> > effectively suppress the latent detection by check_new_page_bad().
> So it's probably that I totally have no idea what are the attributes for
> those special pages so I don't understand enough on why we need to handle
> those pages differently from e.g. PFNMAP pages, and also the benefits.
> I think what I can tell is that they're pages that doesn't have
> PageCompound bits set on either head or tails, however it's still a
> multi-2-order large page. Is there an example on how these pages are used
> and allocated? Why would we need those pages, and whether these pages need
> to be set dirty/accessed after all?

The use case David is interested in is where an AMD GPU driver kmallocs() a
chunk of memory, let's it be mmap()'d by userspace, and userspace then maps it
into the guest for a virtual (passthrough?) GPU. For all intents and purposes,
it's normal memory, just not refcounted.

> > static bool kvm_is_ad_tracked_page(struct page *page)
> > {
> > + /*
> > + * Assert that KVM isn't attempting to mark a freed page as Accessed or
> > + * Dirty, i.e. that KVM's MMU doesn't have a use-after-free bug. KVM
> > + * (typically) doesn't pin pages that are mapped in KVM's MMU, and
> > + * instead relies on mmu_notifiers to know when a mapping needs to be
> > + * zapped/invalidated. Unmapping from KVM's MMU must happen _before_
> > + * KVM returns from its mmu_notifier, i.e. the page should have an
> > + * elevated refcount at this point even though KVM doesn't hold a
> > + * reference of its own.
> > + */
> > + if (WARN_ON_ONCE(!page_count(page)))
> > + return false;
> > +
> > /*
> > * Per page-flags.h, pages tagged PG_reserved "should in general not be
> > * touched (e.g. set dirty) except by its owner".
> >
> This looks like a good thing to have, indeed. But again it doesn't seem
> like anything special to the pages we're discussing here, say, !Compound &&
> refcount==0 ones.

The problem is that if KVM ignores refcount==0 pages, then KVM can't distinguish
between the legitimate[*] refcount==0 AMD GPU case and a buggy refcount==0
use-after-free scenario. I don't want to make that sacrifice as the legimiate
!refcounted use case is a very specific use case, whereas consuming refcounted
memory is ubiquituous (outside of maybe AWS).

[*] Consuming !refcounted pages is safe only for flows that are tied into the
mmu_notifiers. The current proposal/plan is to add an off-by-default module
param that let's userspace opt-in to kmap() use of !refcounted memory, e.g.
this case and PFNMAP memory.