Re: [PATCH v2] arm64: tlb: Fix TLBI RANGE operand

From: Gavin Shan
Date: Thu Apr 04 2024 - 06:26:39 EST


Hi Marc,

On 4/4/24 19:10, Marc Zyngier wrote:
On Thu, 04 Apr 2024 06:36:24 +0100,
Gavin Shan <gshan@xxxxxxxxxx> wrote:

KVM/arm64 relies on TLBI RANGE feature to flush TLBs when the dirty
bitmap is collected by VMM and the corresponding PTEs need to be
write-protected during live migration. Unfortunately, the operand
passed to the TLBI RANGE instruction isn't correctly sorted out by
commit d1d3aa98b1d4 ("arm64: tlb: Use the TLBI RANGE feature in arm64").

This isn't the offending commit. See below.


Yes, I agree with you and more explanations as below.

It leads to crash on the destination VM after live migration because
TLBs aren't flushed completely and some of the dirty pages are missed.

For example, I have a VM where 8GB memory is assigned, starting from
0x40000000 (1GB). Note that the host has 4KB as the base page size.
All TLBs for VM can be covered by one TLBI RANGE operation. However,
the operand 0xffff708000040000 is set for scale -9, and -1 is returned

It's not scale, as it is limited to 2 bits. It's a random value that
actively corrupts adjacent fields because it is wrongly sign-extended.
ASID and TG are now utter bollocks, and the CPU is within its rights
to totally ignore the TLBI (TG indicates 64kB translation granule...).

We really should fix __TLBI_VADDR_RANGE() to use proper bit fields
instead of a bunch of shifts that lead to this mess.


Well, I meant the variable @scale instead of SCALE, part of the operand
to TLBI RANGE instruction. Let me make it clear in next revision.

Yes, __TLBI_VADDR_RANGE() can be improved by masks. However, the source
of the mess is 117940aa6e5f8 ("KVM: arm64: Define kvm_tlb_flush_vmid_range()")
because it can pass @pages (e.g. 0x200000) that __flush_tlb_range_op() can't
handle. It may be another hint to set 117940aa6e5f8 as the fix target.

from __TLBI_RANGE_NUM() for scale 3/2/1/0 and rejected by the loop in
__flush_tlb_range_op(). __TLBI_RANGE_NUM() isn't expected to work
like this because all the pages should be covered by scale 3/2/1/0,
plus an additional page if needed.

Fix the macro __TLBI_RANGE_NUM() so that the correct NUM and TLBI RANGE
operand are provided for each scale level. With the changes, [-1 31]
instead of [-1 30] can be returned from the macro, meaning the TLBs for
0x200000 pages (8GB memory) can be flushed in one shoot at scale 3. The
macro TLBI_RANGE_MASK is dropped since no one uses it any more.

Fixes: d1d3aa98b1d4 ("arm64: tlb: Use the TLBI RANGE feature in arm64")
Cc: stable@xxxxxxxxxx # v5.10+

I don't think this is right. The problem only occurs since
117940aa6e5f8 ("KVM: arm64: Define kvm_tlb_flush_vmid_range()"), which
is the only case where we try to use NUM=31 (the rest of the kernel is
using (MAX_TLBI_RANGE_PAGES - 1), which results in NUM=30 at most).

Also, before e2768b798a19 ("arm64/mm: Modify range-based tlbi to
decrement scale"), we used a different algorithm to perform the
invalidation (increasing scale instead of decreasing), so this
probably doesn't hit the same way.

In any case, this is a KVM-only problem that will never show before
v6.6. So 5.10 really isn't a place where we need to backport anything.


Agreed. It's more precise to set 117940aa6e5f8 as the fix target. I will
correct it in v3.

Fixes: 117940aa6e5f ("KVM: arm64: Define kvm_tlb_flush_vmid_range()")

Reported-by: Yihuang Yu <yihyu@xxxxxxxxxx>
Suggested-by: Marc Zyngier <maz@xxxxxxxxxx>
Signed-off-by: Gavin Shan <gshan@xxxxxxxxxx>
---
v2: Improve __TLBI_RANGE_NUM() as Marc suggested
---
arch/arm64/include/asm/tlbflush.h | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 3b0e8248e1a4..cd9b71c30366 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -161,12 +161,17 @@ static inline unsigned long get_trans_granule(void)
#define MAX_TLBI_RANGE_PAGES __TLBI_RANGE_PAGES(31, 3)
/*
- * Generate 'num' values from -1 to 30 with -1 rejected by the
+ * Generate 'num' values from -1 to 31 with -1 rejected by the
* __flush_tlb_range() loop below.
*/
-#define TLBI_RANGE_MASK GENMASK_ULL(4, 0)
-#define __TLBI_RANGE_NUM(pages, scale) \
- ((((pages) >> (5 * (scale) + 1)) & TLBI_RANGE_MASK) - 1)
+#define __TLBI_RANGE_NUM(pages, scale) \
+ ({ \
+ int __pages = min((pages), \
+ __TLBI_RANGE_PAGES(31, (scale))); \
+ int __numplus1 = __pages >> (5 * (scale) + 1); \
+ \
+ (__numplus1 - 1); \
+ })

This was only intended as a way to convey the general idea. __numplus1
can obviously be removed and the right-shifting expression promoted as
the return value.

Next, the comments in this file need adjustments to reflect the
supported invalidation range, as my original patch did (plus some
more).

Finally, and since we can now handle the full range of invalidation,
it would make sense to fix __flush_tlb_range_nosync() to allow
MAX_TLBI_RANGE_PAGES ranges (potentially in a separate patch).

In the end, my sandbox contains the following, which should probably
be split in 3 patches:


Ok, I thought @__numplus1 was there for better code readability since it's
likely opted out by GCC's optimization. I will drop @__numplus1 anyway. And
the comments need to be improved for sure.

Yeah, I've noticed that __flush_tlb_range_nosync() needs to allow
MAX_TLBI_RANGE_PAGES to do range based TLB flush.

In summary, we need 3 patches but the one fixing __TLBI_RANGE_NUM needs to be
PATCH[1/3] so that it can be easily picked by stable kernel. PATCH[2/3] would
be to improve __TLBI_VADDR_RANGE with masks. PATCH[3/3] will allow __flush_tlb_range_nosync()
to do range-based TLB flush for MAX_TLBI_RANGE_PAGES.

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 3b0e8248e1a4..bcbe697ed191 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -142,16 +142,22 @@ static inline unsigned long get_trans_granule(void)
* EL1, Inner Shareable".
*
*/
+#define TLBIR_ASID_MASK GENMASK_ULL(63, 48)
+#define TLBIR_TG_MASK GENMASK_ULL(47, 46)
+#define TLBIR_SCALE_MASK GENMASK_ULL(45, 44)
+#define TLBIR_NUM_MASK GENMASK_ULL(43, 39)
+#define TLBIR_TTL_MASK GENMASK_ULL(38, 37)
+#define TLBIR_BADDR_MASK GENMASK_ULL(36, 0)
+
#define __TLBI_VADDR_RANGE(baddr, asid, scale, num, ttl) \
({ \
- unsigned long __ta = (baddr); \
+ unsigned long __ta = FIELD_PREP(TLBIR_BADDR_MASK, (baddr)); \
unsigned long __ttl = (ttl >= 1 && ttl <= 3) ? ttl : 0; \
- __ta &= GENMASK_ULL(36, 0); \
- __ta |= __ttl << 37; \
- __ta |= (unsigned long)(num) << 39; \
- __ta |= (unsigned long)(scale) << 44; \
- __ta |= get_trans_granule() << 46; \
- __ta |= (unsigned long)(asid) << 48; \
+ __ta |= FIELD_PREP(TLBIR_TTL_MASK, __ttl); \
+ __ta |= FIELD_PREP(TLBIR_NUM_MASK, (unsigned long)(num)); \
+ __ta |= FIELD_PREP(TLBIR_SCALE_MASK, (unsigned long)(scale)); \
+ __ta |= FIELD_PREP(TLBIR_TG_MASK, get_trans_granule()); \
+ __ta |= FIELD_PREP(TLBIR_ASID_MASK, (unsigned long)(asid)); \
__ta; \
})
@@ -161,12 +167,17 @@ static inline unsigned long get_trans_granule(void)
#define MAX_TLBI_RANGE_PAGES __TLBI_RANGE_PAGES(31, 3)
/*
- * Generate 'num' values from -1 to 30 with -1 rejected by the
- * __flush_tlb_range() loop below.
+ * Generate 'num' values from -1 to 31 with -1 rejected by the
+ * __flush_tlb_range() loop below. Its return value is only
+ * significant for a maximum of MAX_TLBI_RANGE_PAGES pages. If 'pages'
+ * is more than that, you must iterate over the overall range.
*/
-#define TLBI_RANGE_MASK GENMASK_ULL(4, 0)
-#define __TLBI_RANGE_NUM(pages, scale) \
- ((((pages) >> (5 * (scale) + 1)) & TLBI_RANGE_MASK) - 1)
+#define __TLBI_RANGE_NUM(pages, scale) \
+ ({ \
+ int __pages = min((pages), \
+ __TLBI_RANGE_PAGES(31, (scale))); \
+ (__pages >> (5 * (scale) + 1)) - 1; \
+ })
/*
* TLB Invalidation
@@ -379,10 +390,6 @@ static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
* 3. If there is 1 page remaining, flush it through non-range operations. Range
* operations can only span an even number of pages. We save this for last to
* ensure 64KB start alignment is maintained for the LPA2 case.
- *
- * Note that certain ranges can be represented by either num = 31 and
- * scale or num = 0 and scale + 1. The loop below favours the latter
- * since num is limited to 30 by the __TLBI_RANGE_NUM() macro.
*/
#define __flush_tlb_range_op(op, start, pages, stride, \
asid, tlb_level, tlbi_user, lpa2) \
@@ -437,11 +444,11 @@ static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
* When not uses TLB range ops, we can handle up to
* (MAX_DVM_OPS - 1) pages;
* When uses TLB range ops, we can handle up to
- * (MAX_TLBI_RANGE_PAGES - 1) pages.
+ * MAX_TLBI_RANGE_PAGES pages.
*/
if ((!system_supports_tlb_range() &&
(end - start) >= (MAX_DVM_OPS * stride)) ||
- pages >= MAX_TLBI_RANGE_PAGES) {
+ pages > MAX_TLBI_RANGE_PAGES) {
flush_tlb_mm(vma->vm_mm);
return;
}


Thanks,
Gavin