Re: [PATCH v3] kvm: arm/arm64: Fix locking for kvm_free_stage2_pgd

From: Alexander Graf
Date: Fri Apr 21 2017 - 20:28:59 EST




On 04.04.17 12:35, Suzuki K Poulose wrote:
Hi Christoffer,

On 04/04/17 11:13, Christoffer Dall wrote:
Hi Suzuki,

On Mon, Apr 03, 2017 at 03:12:43PM +0100, Suzuki K Poulose wrote:
In kvm_free_stage2_pgd() we don't hold the kvm->mmu_lock while calling
unmap_stage2_range() on the entire memory range for the guest. This
could
cause problems with other callers (e.g, munmap on a memslot) trying to
unmap a range. And since we have to unmap the entire Guest memory range
holding a spinlock, make sure we yield the lock if necessary, after we
unmap each PUD range.

Fixes: commit d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")
Cc: stable@xxxxxxxxxxxxxxx # v3.10+
Cc: Paolo Bonzini <pbonzin@xxxxxxxxxx>
Cc: Marc Zyngier <marc.zyngier@xxxxxxx>
Cc: Christoffer Dall <christoffer.dall@xxxxxxxxxx>
Cc: Mark Rutland <mark.rutland@xxxxxxx>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@xxxxxxx>
[ Avoid vCPU starvation and lockup detector warnings ]
Signed-off-by: Marc Zyngier <marc.zyngier@xxxxxxx>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@xxxxxxx>


This unfortunately fails to build on 32-bit ARM, and I also think we
intended to check against S2_PGDIR_SIZE, not S2_PUD_SIZE.

Sorry about that, I didn't test the patch with arm32. I am fine the
patch below. And I agree that the name change does make things more
readable. See below for a hunk that I posted to the kbuild report.


How about adding this to your patch (which includes a rename of
S2_PGD_SIZE which is horribly confusing as it indicates the size of the
first level stage-2 table itself, where S2_PGDIR_SIZE indicates the size
of address space mapped by a single entry in the same table):

diff --git a/arch/arm/include/asm/stage2_pgtable.h
b/arch/arm/include/asm/stage2_pgtable.h
index 460d616..c997f2d 100644
--- a/arch/arm/include/asm/stage2_pgtable.h
+++ b/arch/arm/include/asm/stage2_pgtable.h
@@ -35,10 +35,13 @@

#define stage2_pud_huge(pud) pud_huge(pud)

+#define S2_PGDIR_SIZE PGDIR_SIZE
+#define S2_PGDIR_MASK PGDIR_MASK
+
/* Open coded p*d_addr_end that can deal with 64bit addresses */
static inline phys_addr_t stage2_pgd_addr_end(phys_addr_t addr,
phys_addr_t end)
{
- phys_addr_t boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
+ phys_addr_t boundary = (addr + S2_PGDIR_SIZE) & S2_PGDIR_MASK;

return (boundary - 1 < end - 1) ? boundary : end;
}
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index db94f3a..6e79a4c 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -41,7 +41,7 @@ static unsigned long hyp_idmap_start;
static unsigned long hyp_idmap_end;
static phys_addr_t hyp_idmap_vector;

-#define S2_PGD_SIZE (PTRS_PER_S2_PGD * sizeof(pgd_t))
+#define S2_PGD_TABLE_SIZE (PTRS_PER_S2_PGD * sizeof(pgd_t))
#define hyp_pgd_order get_order(PTRS_PER_PGD * sizeof(pgd_t))

#define KVM_S2PTE_FLAG_IS_IOMAP (1UL << 0)
@@ -299,7 +299,7 @@ static void unmap_stage2_range(struct kvm *kvm,
phys_addr_t start, u64 size)
* If the range is too large, release the kvm->mmu_lock
* to prevent starvation and lockup detector warnings.
*/
- if (size > S2_PUD_SIZE)
+ if (size > S2_PGDIR_SIZE)
cond_resched_lock(&kvm->mmu_lock);
next = stage2_pgd_addr_end(addr, end);
if (!stage2_pgd_none(*pgd))
@@ -747,7 +747,7 @@ int kvm_alloc_stage2_pgd(struct kvm *kvm)
}

/* Allocate the HW PGD, making sure that each page gets its own
refcount */
- pgd = alloc_pages_exact(S2_PGD_SIZE, GFP_KERNEL | __GFP_ZERO);
+ pgd = alloc_pages_exact(S2_PGD_TABLE_SIZE, GFP_KERNEL | __GFP_ZERO);
if (!pgd)
return -ENOMEM;

@@ -843,7 +843,7 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
spin_unlock(&kvm->mmu_lock);

/* Free the HW pgd, one page at a time */
- free_pages_exact(kvm->arch.pgd, S2_PGD_SIZE);
+ free_pages_exact(kvm->arch.pgd, S2_PGD_TABLE_SIZE);
kvm->arch.pgd = NULL;
}


Btw, I have a different hunk to solve the problem, posted to the kbuild
report. I will post it here for the sake of capturing the discussion in
one place. The following hunk on top of the patch, changes the lock
release after we process one PGDIR entry. As for the first time
we enter the loop we haven't done much with the lock held, hence it may
make
sense to do it after the first round and we have more work to do.

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index db94f3a..582a972 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -295,15 +295,15 @@ static void unmap_stage2_range(struct kvm *kvm,
phys_addr_t start, u64 size)
assert_spin_locked(&kvm->mmu_lock);
pgd = kvm->arch.pgd + stage2_pgd_index(addr);
do {
+ next = stage2_pgd_addr_end(addr, end);
+ if (!stage2_pgd_none(*pgd))

Just as heads up, I had this version applied to my tree by accident (commit 8b3405e345b5a098101b0c31b264c812bba045d9 from Christoffer's queue) and ran into a NULL pointer dereference:

[223090.242280] Unable to handle kernel NULL pointer dereference at virtual address 00000040
[223090.262330] PC is at unmap_stage2_range+0x8c/0x428
[223090.262332] LR is at kvm_unmap_hva_handler+0x2c/0x3c
[223090.262531] Call trace:
[223090.262533] [<ffff0000080adb78>] unmap_stage2_range+0x8c/0x428
[223090.262535] [<ffff0000080adf40>] kvm_unmap_hva_handler+0x2c/0x3c
[223090.262537] [<ffff0000080ace2c>] handle_hva_to_gpa+0xb0/0x104
[223090.262539] [<ffff0000080af988>] kvm_unmap_hva+0x5c/0xbc
[223090.262543] [<ffff0000080a2478>] kvm_mmu_notifier_invalidate_page+0x50/0x8c
[223090.262547] [<ffff0000082274f8>] __mmu_notifier_invalidate_page+0x5c/0x84
[223090.262551] [<ffff00000820b700>] try_to_unmap_one+0x1d0/0x4a0
[223090.262553] [<ffff00000820c5c8>] rmap_walk+0x1cc/0x2e0
[223090.262555] [<ffff00000820c90c>] try_to_unmap+0x74/0xa4
[223090.262557] [<ffff000008230ce4>] migrate_pages+0x31c/0x5ac
[223090.262561] [<ffff0000081f869c>] compact_zone+0x3fc/0x7ac
[223090.262563] [<ffff0000081f8ae0>] compact_zone_order+0x94/0xb0
[223090.262564] [<ffff0000081f91c0>] try_to_compact_pages+0x108/0x290
[223090.262569] [<ffff0000081d5108>] __alloc_pages_direct_compact+0x70/0x1ac
[223090.262571] [<ffff0000081d64a0>] __alloc_pages_nodemask+0x434/0x9f4
[223090.262572] [<ffff0000082256f0>] alloc_pages_vma+0x230/0x254
[223090.262574] [<ffff000008235e5c>] do_huge_pmd_anonymous_page+0x114/0x538
[223090.262576] [<ffff000008201bec>] handle_mm_fault+0xd40/0x17a4
[223090.262577] [<ffff0000081fb324>] __get_user_pages+0x12c/0x36c
[223090.262578] [<ffff0000081fb804>] get_user_pages_unlocked+0xa4/0x1b8
[223090.262579] [<ffff0000080a3ce8>] __gfn_to_pfn_memslot+0x280/0x31c
[223090.262580] [<ffff0000080a3dd0>] gfn_to_pfn_prot+0x4c/0x5c
[223090.262582] [<ffff0000080af3f8>] kvm_handle_guest_abort+0x240/0x774
[223090.262584] [<ffff0000080b2bac>] handle_exit+0x11c/0x1ac
[223090.262586] [<ffff0000080ab99c>] kvm_arch_vcpu_ioctl_run+0x31c/0x648
[223090.262587] [<ffff0000080a1d78>] kvm_vcpu_ioctl+0x378/0x768
[223090.262590] [<ffff00000825df5c>] do_vfs_ioctl+0x324/0x5a4
[223090.262591] [<ffff00000825e26c>] SyS_ioctl+0x90/0xa4
[223090.262595] [<ffff000008085d84>] el0_svc_naked+0x38/0x3c

0xffff0000080adb78 is in unmap_stage2_range (../arch/arm/kvm/mmu.c:260).
255 pud_t *pud, *start_pud;
256
257 start_pud = pud = stage2_pud_offset(pgd, addr);
258 do {
259 next = stage2_pud_addr_end(addr, end);
260 if (!stage2_pud_none(*pud)) {
261 if (stage2_pud_huge(*pud)) {
262 pud_t old_pud = *pud;
263
264 stage2_pud_clear(pud);


So please beware that for some reason pud may become invalid after rescheduling.


Alex