Re: [PATCH] KVM: arm/arm64: fix unaligned hva start and end in handle_hva_to_gpa

From: Jia He
Date: Mon May 14 2018 - 22:03:47 EST


Hi Suzuki

I will merge the other thread into this, and add the necessary CC list

That WARN_ON call trace is very easy to reproduce in my armv8a server after I start 20 guests

and run memhog in the host. Of course, ksm should be enabled

For you question about my inject fault debug patch:

diff --git a/mm/ksm.c b/mm/ksm.c
index e3cbf9a..876bec8 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -43,6 +43,8 @@
Â#include <asm/tlbflush.h>
Â#include "internal.h"

+int trigger_by_ksm = 0;
+
Â#ifdef CONFIG_NUMA
Â#define NUMA(x)ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ (x)
Â#define DO_NUMA(x)ÂÂÂÂ do { (x); } while (0)
@@ -2587,11 +2589,14 @@ void rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc)
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ continue;

+trigger_by_ksm = 1;
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ if (!rwc->rmap_one(page, vma,
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ rmap_item->address, rwc->arg)) {
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ anon_vma_unlock_read(anon_vma);
+trigger_by_ksm = 0;
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ return;
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ }
+trigger_by_ksm = 0;
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ if (rwc->done && rwc->done(page)) {
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ anon_vma_unlock_read(anon_vma);
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ return;
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 7f6a944..ab8545e 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -290,12 +290,17 @@ static void unmap_stage2_puds(struct kvm *kvm, pgd_t *pgd,
 * destroying the VM), otherwise another faulting VCPU may come in and mess
 * with things behind our backs.
 */
+extern int trigger_by_ksm;
Âstatic void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
Â{
ÂÂÂÂÂÂÂ pgd_t *pgd;
ÂÂÂÂÂÂÂ phys_addr_t addr = start, end = start + size;
ÂÂÂÂÂÂÂ phys_addr_t next;

+ÂÂÂÂÂÂ if(trigger_by_ksm) {
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ end -= 0x200;
+ÂÂÂÂÂÂ }
+
ÂÂÂÂÂÂÂ assert_spin_locked(&kvm->mmu_lock);
ÂÂÂÂÂÂÂ pgd = kvm->arch.pgd + stage2_pgd_index(addr);
ÂÂÂÂÂÂÂ do {

I need to point out that I never reproduced it without this debugging patch.

Please also see my comments below

On 5/14/2018 6:06 PM, Suzuki K Poulose Wrote:
On 14/05/18 03:30, Jia He wrote:

On 5/11/2018 9:39 PM, Suzuki K Poulose Wrote:
Marc

Thanks for looping me in. Comments below.


On 03/05/18 03:02, Jia He wrote:
Hi Marc

Thanks for the review


On 5/2/2018 10:26 PM, Marc Zyngier Wrote:
[+ Suzuki]

On 02/05/18 08:08, Jia He wrote:
From: Jia He <jia.he@xxxxxxxxxxxxxxxx>

In our armv8a server (QDF2400), I noticed a WARN_ON as follows:

[Â 800.202850] WARNING: CPU: 33 PID: 255 at arch/arm64/kvm/../../../virt/kvm/arm/mmu.c:1670 kvm_age_hva_handler+0xcc/0xd4
Which kernel version is that? I don't have a WARN_ON() at this line in
4.17. Do you have a reproducer?
My running kernel version is v4.14-15, but I can reproduced it in 4.17 (start 20 guests and run memhog in the host)
In 4.17, the warn_on is at
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/virt/kvm/arm/mmu.c#n1826
[Â 800.213535] Modules linked in: vhost_net vhost tap xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm vfat fat iw_cm mlx5_ib ib_core dm_mirror dm_region_hash dm_log dm_mod crc32_ce ipmi_ssif sg nfsd
[Â 800.284115]Â auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c mlx5_core ixgbe mlxfw devlink mdio ahci_platform libahci_platform qcom_emac libahci hdma hdma_mgmt i2c_qup
[Â 800.300382] CPU: 33 PID: 255 Comm: khugepaged Tainted: GÂÂÂÂÂÂÂ WÂÂÂÂÂÂ 4.14.36+ #6
[Â 800.308030] Hardware name: <snip for confidential issues>
Well, that's QDF2400, right? ;-)
yes, exactly :)
[Â 800.318717] task: ffff8017c949c000 task.stack: ffff8017c9498000
[Â 800.324629] PC is at kvm_age_hva_handler+0xcc/0xd4
[Â 800.329412] LR is at handle_hva_to_gpa+0xec/0x15c
[Â 800.334109] pc : [<ffff0000080b4f2c>] lr : [<ffff0000080b4838>] pstate: 20400145
[Â 800.341496] sp : ffff8017c949b260
[Â 800.344804] x29: ffff8017c949b260 x28: ffff801663e25008
[Â 800.350110] x27: 0000000000020000 x26: 00000001fb1a0000
[Â 800.355416] x25: 0000ffff605b0200 x24: 0000ffff605a0200
[Â 800.360722] x23: 0000000000000000 x22: 000000000000ffff
[Â 800.366028] x21: 00000001fb1a0000 x20: ffff8017c085a000
[Â 800.371334] x19: ffff801663e20008 x18: 0000000000000000
[Â 800.376641] x17: 0000000000000000 x16: 0000000000000000
[Â 800.381947] x15: 0000000000000000 x14: 3d646e655f617668
[Â 800.387254] x13: 2c30303230623530 x12: 36666666663d646e
[Â 800.392560] x11: 652c303032306135 x10: 3036666666663d74
[Â 800.397867] x9 : 0000000000003796 x8 : 655f6e66672c3030
[Â 800.403173] x7 : ffff00000859434c x6 : ffff8017f9c30cb8
[Â 800.408479] x5 : ffff8017f9c30cb8 x4 : ffff0000080b4e60
[Â 800.413786] x3 : 0000000000000000 x2 : 0000000000020000
[Â 800.419092] x1 : 00000001fb1a0000 x0 : 0000000020000000
[Â 800.424398] Call trace:
[Â 800.426838] Exception stack(0xffff8017c949b120 to 0xffff8017c949b260)
[Â 800.433272] b120: 0000000020000000 00000001fb1a0000 0000000000020000 0000000000000000
[Â 800.441095] b140: ffff0000080b4e60 ffff8017f9c30cb8 ffff8017f9c30cb8 ffff00000859434c
[Â 800.448918] b160: 655f6e66672c3030 0000000000003796 3036666666663d74 652c303032306135
[Â 800.456740] b180: 36666666663d646e 2c30303230623530 3d646e655f617668 0000000000000000
[Â 800.464563] b1a0: 0000000000000000 0000000000000000 0000000000000000 ffff801663e20008
[Â 800.472385] b1c0: ffff8017c085a000 00000001fb1a0000 000000000000ffff 0000000000000000
[Â 800.480208] b1e0: 0000ffff605a0200 0000ffff605b0200 00000001fb1a0000 0000000000020000
[Â 800.488030] b200: ffff801663e25008 ffff8017c949b260 ffff0000080b4838 ffff8017c949b260
[Â 800.495853] b220: ffff0000080b4f2c 0000000020400145 0000000000000001 ffff8017c949b2a0
[Â 800.503676] b240: ffffffffffffffff ffff8017c949b260 ffff8017c949b260 ffff0000080b4f2c
[Â 800.511498] [<ffff0000080b4f2c>] kvm_age_hva_handler+0xcc/0xd4
[Â 800.517324] [<ffff0000080b4838>] handle_hva_to_gpa+0xec/0x15c
[Â 800.523063] [<ffff0000080b6c5c>] kvm_age_hva+0x5c/0xcc
[Â 800.528194] [<ffff0000080a7c3c>] kvm_mmu_notifier_clear_flush_young+0x54/0x90
[Â 800.535324] [<ffff00000827a0e8>] __mmu_notifier_clear_flush_young+0x6c/0xa8
[Â 800.542279] [<ffff00000825a644>] page_referenced_one+0x1e0/0x1fc
[Â 800.548279] [<ffff00000827e8f8>] rmap_walk_ksm+0x124/0x1a0
[Â 800.553759] [<ffff00000825c974>] rmap_walk+0x94/0x98
[Â 800.558717] [<ffff00000825ca98>] page_referenced+0x120/0x180
[Â 800.564369] [<ffff000008228c58>] shrink_active_list+0x218/0x4a4
[Â 800.570281] [<ffff000008229470>] shrink_node_memcg+0x58c/0x6fc
[Â 800.576107] [<ffff0000082296c4>] shrink_node+0xe4/0x328
[Â 800.581325] [<ffff000008229c9c>] do_try_to_free_pages+0xe4/0x3b8
[Â 800.587324] [<ffff00000822a094>] try_to_free_pages+0x124/0x234
[Â 800.593150] [<ffff000008216aa0>] __alloc_pages_nodemask+0x564/0xf7c
[Â 800.599412] [<ffff000008292814>] khugepaged_alloc_page+0x38/0xb8
[Â 800.605411] [<ffff0000082933bc>] collapse_huge_page+0x74/0xd70
[Â 800.611238] [<ffff00000829470c>] khugepaged_scan_mm_slot+0x654/0xa98
[Â 800.617585] [<ffff000008294e0c>] khugepaged+0x2bc/0x49c
[Â 800.622803] [<ffff0000080ffb70>] kthread+0x124/0x150
[Â 800.627762] [<ffff0000080849f0>] ret_from_fork+0x10/0x1c
[Â 800.633066] ---[ end trace 944c130b5252fb01 ]---
-------------------------------------------------------------------------

The root cause might be: we can't guarantee that the parameter start and end
in handle_hva_to_gpa is PAGE_SIZE aligned, let alone hva_start and hva_end.
So why not aligning them the first place?
at the first place of handle_hva_to_gpa()?
but boundary check is needed in each loop of kvm_for_each_memslot. Am I missing anything here?
This bug is introduced by commit 056aad67f836 ("kvm: arm/arm64: Rework gpa
callback handlers")

It fixes the bug by use pfn size converted.

Fixes: 056aad67f836 ("kvm: arm/arm64: Rework gpa callback handlers")

Signed-off-by: jia.he@xxxxxxxxxxxxxxxx
Signed-off-by: li.zhang@xxxxxxxxxxxxxxxx
---
ÂÂ virt/kvm/arm/mmu.c | 10 ++++++----
ÂÂ 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 7f6a944..9dd7ae4 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1744,7 +1744,7 @@ static int handle_hva_to_gpa(struct kvm *kvm,
ÂÂÂÂÂÂ /* we only care about the pages that the guest sees */
ÂÂÂÂÂÂ kvm_for_each_memslot(memslot, slots) {
ÂÂÂÂÂÂÂÂÂÂ unsigned long hva_start, hva_end;
-ÂÂÂÂÂÂÂ gfn_t gpa;
+ÂÂÂÂÂÂÂ gpa_t gpa, gpa_end;
ÂÂÂÂÂÂÂÂÂÂ hva_start = max(start, memslot->userspace_addr);
ÂÂÂÂÂÂÂÂÂÂ hva_end = min(end, memslot->userspace_addr +
@@ -1753,7 +1753,9 @@ static int handle_hva_to_gpa(struct kvm *kvm,
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ continue;
ÂÂÂÂÂÂÂÂÂÂ gpa = hva_to_gfn_memslot(hva_start, memslot) << PAGE_SHIFT;
-ÂÂÂÂÂÂÂ ret |= handler(kvm, gpa, (u64)(hva_end - hva_start), data);
+ÂÂÂÂÂÂÂ gpa_end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, memslot)
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ << PAGE_SHIFT;
+ÂÂÂÂÂÂÂ ret |= handler(kvm, gpa, (u64)(gpa_end - gpa), data);
But we're looking for the mapping in the same memslot, so the distance
between hva and hva_end is the same as the one between gpa and gpa_end
if you didn't align it.
maybe not, sometimes hva_end-hva != gpa_end-gpa
start=fffdc37f0200,hva_start=fffdc37f0200,end=fffdc3800200,hva_end=fffdc3800000,gpa=3ff0000,gfn_end=4000000

but sometimes it is:
start=ffff60590200,hva_start=ffff60590200,end=ffff605a0200,hva_end=ffff605a0200,gpa=1fb190000,gfn_end=1fb1b0000

IMO, the unalignment is caused by the ksm stable page flag STABLE_FLAG. I will
propose another ksm patch to fix itã
But from handle_hva_to_gpa's point of view, arm kvm needs to void the followup
exception, just like what powerpc andx86 have done.

As far as I can see this is triggered by someone (in this page_referenced_one via ksm?)
triggering a clear_flush_young for a page, with a non-aligned page address.

If you look at the code path, the __mmu_notifier_clear_flush_young is invoked
via 2 code paths with the "given" address.

ptep_clear_flush_young_notify(), in which case the end is set to + PAGE_SIZE
pmdp_clear_flush_young_notify(), in which case the end is set to + PMD_SIZE

We were supposed to only clear_flush_young for *the page* containing
address (start), but we do a clear_flush_young for the next page
as well, which (I think) is not something intended. So to me, it looks like, either
page_referenced_one() or its caller must align the address to the PAGE_SIZE
or PMD_SIZE depending on what it really wants to do, to avoid touching
the adjacent entries (page or block pages).

Suzuki
Jia He,

Suzuki, thanks for the comments.

I proposed another ksm patch https://lkml.org/lkml/2018/5/3/1042
The root cause is ksm will add some extra flags to indicate that the page
is in/not_in the stable tree. This makes address not be aligned with PAGE_SIZE.
Thanks for the pointer. In the future, please Cc the people relevant to the
discussion in the patches.

From arm kvm mmu point of view, do you think handle_hva_to_gpa still need to handle
the unalignment case?
I don't think we should do that. Had we done this, we would never have caught this bug
in KSM. Eventually if some other new implementation comes up with the a new notifier
consumer which doesn't check alignment and doesn't WARN, it could simply do the wrong
thing. So I believe what we have is a good measure to make sure that things are
in the right order.

IMO, the PAGE_SIZE alignment is still needed because we should not let the bottom function
kvm_age_hva_handler to handle the exception. Please refer to the implementation in X86 and
powerpc kvm_handle_hva_range(). They both aligned the hva with hva_to_gfn_memslot.

From an API perspective, you are passed on a "start" and "end" address. So, you could potentially
do the wrong thing if you align the "start" and "end". May be those handlers should also do the
same thing as we do.
But handle_hva_to_gpa has partially adjusted the alignment possibly:
ÂÂ 1750ÂÂÂÂÂÂÂÂ kvm_for_each_memslot(memslot, slots) {
ÂÂ 1751ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ unsigned long hva_start, hva_end;
ÂÂ 1752ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ gfn_t gpa;
ÂÂ 1753
ÂÂ 1754ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ hva_start = max(start, memslot->userspace_addr);
ÂÂ 1755ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ hva_end = min(end, memslot->userspace_addr +
ÂÂ 1756ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ (memslot->npages << PAGE_SHIFT));

at line 1755, let us assume that end=0x12340200 and
memslot->userspace_addr + (memslot->npages << PAGE_SHIFT)=0x12340000
Then, hva_start is not page_size aligned and hva_end is aligned, and the size will be PAGE_SIZE-0x200,
just as what I had done in the inject fault debugging patch.

--
Cheers,
Jia