Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU

From: Christian Borntraeger
Date: Fri Jun 25 2021 - 03:45:40 EST




On 24.06.21 14:57, Nicholas Piggin wrote:
Excerpts from Paolo Bonzini's message of June 24, 2021 10:41 pm:
On 24/06/21 13:42, Nicholas Piggin wrote:
Excerpts from Nicholas Piggin's message of June 24, 2021 8:34 pm:
Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
KVM supports mapping VM_IO and VM_PFNMAP memory into the guest by using
follow_pte in gfn_to_pfn. However, the resolved pfns may not have
assoicated struct pages, so they should not be passed to pfn_to_page.
This series removes such calls from the x86 and arm64 secondary MMU. To
do this, this series modifies gfn_to_pfn to return a struct page in
addition to a pfn, if the hva was resolved by gup. This allows the
caller to call put_page only when necessated by gup.

This series provides a helper function that unwraps the new return type
of gfn_to_pfn to provide behavior identical to the old behavior. As I
have no hardware to test powerpc/mips changes, the function is used
there for minimally invasive changes. Additionally, as gfn_to_page and
gfn_to_pfn_cache are not integrated with mmu notifier, they cannot be
easily changed over to only use pfns.

This addresses CVE-2021-22543 on x86 and arm64.

Does this fix the problem? (untested I don't have a POC setup at hand,
but at least in concept)

This one actually compiles at least. Unfortunately I don't have much
time in the near future to test, and I only just found out about this
CVE a few hours ago.

And it also works (the reproducer gets an infinite stream of userspace
exits and especially does not crash). We can still go for David's
solution later since MMU notifiers are able to deal with this pages, but
it's a very nice patch for stable kernels.

Oh nice, thanks for testing. How's this?

Thanks,
Nick

---

KVM: Fix page ref underflow for regions with valid but non-refcounted pages

It's possible to create a region which maps valid but non-refcounted
pages (e.g., tail pages of non-compound higher order allocations). These
host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family
of APIs, which take a reference to the page, which takes it from 0 to 1.
When the reference is dropped, this will free the page incorrectly.

Fix this by only taking a reference on the page if it was non-zero,
which indicates it is participating in normal refcounting (and can be
released with put_page).

Signed-off-by: Nicholas Piggin <npiggin@xxxxxxxxx>
---
virt/kvm/kvm_main.c | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6a6bc7af0e28..46fb042837d2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2055,6 +2055,13 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
return true;
}
+static int kvm_try_get_pfn(kvm_pfn_t pfn)
+{
+ if (kvm_is_reserved_pfn(pfn))
+ return 1;
+ return get_page_unless_zero(pfn_to_page(pfn));
+}
+
static int hva_to_pfn_remapped(struct vm_area_struct *vma,
unsigned long addr, bool *async,
bool write_fault, bool *writable,
@@ -2104,13 +2111,21 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
* Whoever called remap_pfn_range is also going to call e.g.
* unmap_mapping_range before the underlying pages are freed,
* causing a call to our MMU notifier.
+ *
+ * Certain IO or PFNMAP mappings can be backed with valid
+ * struct pages, but be allocated without refcounting e.g.,
+ * tail pages of non-compound higher order allocations, which
+ * would then underflow the refcount when the caller does the
+ * required put_page. Don't allow those pages here.
*/
- kvm_get_pfn(pfn);
+ if (!kvm_try_get_pfn(pfn))
+ r = -EFAULT;

Right. That should also take care of s390 (pin_guest_page in vsie.c
which calls gfn_to_page).
FWIW, the current API is really hard to follow as it does not tell
which functions take a reference and which dont.

Anyway, this patch (with cc stable?)

Reviewed-by: Christian Borntraeger <borntraeger@xxxxxxxxxx>

out:
pte_unmap_unlock(ptep, ptl);
*p_pfn = pfn;
- return 0;
+
+ return r;
}
/*