Re: [PATCH v9 3/6] KVM: mmu: Improve handling of non-refcounted pfns

From: Sean Christopherson
Date: Mon Feb 05 2024 - 21:54:32 EST

Next message: Ricardo Neri: "Re: [PATCH 6/9] thermal: intel: hfi: Enable Intel Thread Director"
Previous message: Jacky Huang: "[PATCH v4 4/4] pinctrl: nuvoton: Add ma35d1 pinctrl and GPIO driver"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Oct 03, 2023, Maxim Levitsky wrote:
> У пн, 2023-09-11 у 11:16 +0900, David Stevens пише:
> > From: David Stevens <stevensd@xxxxxxxxxxxx>
> > The fact that non-refcounted pfns can no longer be accessed without mmu
> > notifier protection is a breaking change. Since there is no timeline for
> > updating everything in KVM to use mmu notifiers, this change adds an
> > opt-in module parameter called allow_unsafe_mappings to allow such
> > mappings. Systems which trust userspace not to tear down such unsafe
> > mappings while KVM is using them can set this parameter to re-enable the
> > legacy behavior.
>
> Do you have a practical example of a VM that can break with this change?
> E.g will a normal VM break? will a VM with VFIO devices break? Will a VM with
> hugepages mapped into it break?
>
> Will the trick of limiting the kernel memory with 'mem=X', and then use the
> extra 'upper memory' for VMs still work?

This is the trick that will require an opt-in from the admin. Anything where KVM
is effectively relying on userspace to pinky swear that the memory won't be
migrated, freed, etc.

It's unlikely, but theoretically possible that it might break existing setups for
"normal" VMs. E.g. if a VM is using VM_PFNMAP'd memory for a nested VMCS. But
such setups are already wildly broken, their users just don't know it. The proposal
here is to require admins for such setups to opt-in to the "unsafe" behavior,
i.e. give backwards compatibility, but make the admin explicitly acknowledge that
what they are doing may have unwanted consequences.

> > + /*
> > + * True if the returned pfn is for a page with a valid refcount. False
> > + * if the returned pfn has no struct page or if the struct page is not
> > + * being refcounted (e.g. tail pages of non-compound higher order
> > + * allocations from IO/PFNMAP mappings).
> >
> Aren't all tail pages not-refcounted (e.g tail page of a hugepage?)
> I haven't researched this topic yet.

Nope. As Christoph stated, they are most definitely "weird" allocations though.
In this case, IIRC, it's the DRM's Translation Table Manager (TTM) code that
kmalloc's a large chunk of memory, and then stuffs the pfn into the page tables
courtesy of the vmf_insert_pfn_prot() in ttm_bo_vm_fault_reserved().

The head page has a non-zero refcount, but it's not really refcounted. And the
tail pages have nothing, which IIRC, results in KVM inadvertantly causing pages
to be freed due to putting the last refeferences.

[*] https://lore.kernel.org/all/ZRZeaP7W5SuereMX@xxxxxxxxxxxxx

> > + *
> > + * When this output flag is false, callers should not try to convert
> > + * the pfn to a struct page.

This should explain what the flag tracks, not what callers should or shouldn't
do with the flag. E.g. strictly speaking, there's no danger in grabbing the
corresponding "struct page" if the caller verifies it's a valid PFN. Trying to
use the page outside of mmu_notifier protection is where things would get dicey.

> > + */
> > + bool is_refcounted_page;
> > };
> >
> > kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 9b33a59c6d65..235c5cb3fdac 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -96,6 +96,10 @@ unsigned int halt_poll_ns_shrink;
> > module_param(halt_poll_ns_shrink, uint, 0644);
> > EXPORT_SYMBOL_GPL(halt_poll_ns_shrink);
> >
> > +/* Allow non-struct page memory to be mapped without MMU notifier protection. */
> > +static bool allow_unsafe_mappings;
> > +module_param(allow_unsafe_mappings, bool, 0444);
> > +
> > /*
> > * Ordering of locks:
> > *
> > @@ -2507,6 +2511,15 @@ static inline int check_user_page_hwpoison(unsigned long addr)
> > return rc == -EHWPOISON;
> > }
> >
> > +static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll,
> > + struct page *page)
> > +{
> > + kvm_pfn_t pfn = page_to_pfn(page);
> > +
> > + foll->is_refcounted_page = true;
> > + return pfn;
> > +}
>
> Just a matter of taste but to me this function looks confusing.
> IMHO, just duplicating these two lines of code is better.
> However if you prefer I won't argue over this.

Hrm, when I suggested this, there was also a put_page() and a comment about hacky
code in there:

static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll,
struct page *page)
{
kvm_pfn_t pfn = page_to_pfn(page);

foll->is_refcounted_page = true;

/*
* FIXME: Ideally, KVM wouldn't pass FOLL_GET to gup() when the caller
* doesn't want to grab a reference, but gup() doesn't support getting
* just the pfn, i.e. FOLL_GET is effectively mandatory. If that ever
* changes, drop this and simply don't pass FOLL_GET to gup().
*/
if (!(foll->flags & FOLL_GET))
put_page(page);

return pfn;
}

More below.

> > - /*
> > - * Get a reference here because callers of *hva_to_pfn* and
> > - * *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the
> > - * returned pfn. This is only needed if the VMA has VM_MIXEDMAP
> > - * set, but the kvm_try_get_pfn/kvm_release_pfn_clean pair will
> > - * simply do nothing for reserved pfns.
> > - *
> > - * Whoever called remap_pfn_range is also going to call e.g.
> > - * unmap_mapping_range before the underlying pages are freed,
> > - * causing a call to our MMU notifier.
> > - *
> > - * Certain IO or PFNMAP mappings can be backed with valid
> > - * struct pages, but be allocated without refcounting e.g.,
> > - * tail pages of non-compound higher order allocations, which
> > - * would then underflow the refcount when the caller does the
> > - * required put_page. Don't allow those pages here.
> > - */
>
> Why the comment is removed? as far as I see the code still grabs a reference
> to the page.

Because what the above comment is saying is effectively obsoleted by is_refcounted_page.
The old comment is essentially saying, "grab a reference because KVM expects to
put a reference", whereas as the new code grabs a reference because it's necessary
to honor allow_unsafe_mappings.

So I agree with David that the existing comment should go away, but I agree with
you in that kvm_follow_refcounted_pfn() really needs a comment, e.g. to explain
how KVM manages struct page refcounts.

> > - if (!kvm_try_get_pfn(pfn))
> > - r = -EFAULT;
> > + if (get_page_unless_zero(page))
> > + WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn);
>
> Once again, the kvm_follow_refcounted_pfn usage is confusing IMHO. It sets
> the 'foll->is_refcounted_page', and yet someone can think that it's only
> there for the WARN_ON_ONCE.
>
> That IMHO would read better:
>
> if (get_page_unless_zero(page))
> foll->is_refcounted_page = true;
>
> WARN_ON_ONCE(page_to_pfn(page) != pfn);
>
> Note that I moved the warn out of the 'get_page_unless_zero' condition
> because I think that this condition should be true for non refcounted pages
> as well.

Yeah, let me see if I can piece together what happened to the put_page() call.

> Also I don't understand why 'get_page_unless_zero(page)' result is ignored.
> As I understand it, it will increase refcount of a page unless it is zero
>
> If a refcount of a refcounted page is 0 isn't that a bug?

Yes, the problem is that KVM doesn't know if a page is refcounted or not, without
actually trying to acquire a reference. See the TTM mess mentioned above.

Note, Christoph is suggesting that KVM instead refuse to play nice and force the
TTM code to properly refcount things. I very much like that idea in theory, but
I have no idea how feasible it is (that code is all kinds of twisty).

> The page was returned from kvm_pfn_to_refcounted_page which supposed only to
> return pages that are refcounted.
>
> I might not understand something in regard to 'get_page_unless_zero(page)'
> usage both in old and the new code.

Next message: Ricardo Neri: "Re: [PATCH 6/9] thermal: intel: hfi: Enable Intel Thread Director"
Previous message: Jacky Huang: "[PATCH v4 4/4] pinctrl: nuvoton: Add ma35d1 pinctrl and GPIO driver"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]