[PATCH] KVM: arm64: Limit stage2_apply_range() batch size to smallest block
From: Krister Johansen
Date: Thu Mar 28 2024 - 15:22:14 EST
stage2_apply_range() for unmap operations can interfere with the
performance of IO if the device's interrupts share the CPU where the
unmap operation is occurring. commit 5994bc9e05c2 ("KVM: arm64: Limit
stage2_apply_range() batch size to largest block") improved this. Prior
to that commit, workloads that were unfortunate enough to have their IO
interrupts pinned to the same CPU as the unmap operation would observe a
complete stall. With the switch to using the largest block size, it is
possible for IO to make progress, albeit at a reduced speed.
This author tested network and storage where the interrupts were pinned
to the same CPU where an unmap was occurring and found that throughput
was reduced about 4.75-5.8x for networking, and 65.5x-500x for storage.
The use-case where this has been especially painful is with hardware
virtualized containers. Many containers have a short lifetime and may
be run on systems where the host is intentionally oversubscribed. This
limits the options for pinning and prefaulting. Although NIC interrupts
allow their CPU affinity to be altered, some NVMe devices do not permit
it. Some cloud-block storage devices have only a few queues, which
means unlucky placement can have high performance impact.
Further reducing the stage2_apply_range() batch size has substantial
performance improvements for IO that share a CPU performing an unmap
operation. By switching to a 2mb chunk, IO performance regressions were
no longer observed in this author's tests. E.g. it was possible to
obtain the advertised device throughput despite an unmap operation
occurring on the CPU where the interrupt was running. There is a
tradeoff, however. No changes were observed in per-operation timings
when running the kvm_pagetable_test without an interrupt load. However,
with a 64gb VM, 1 vcpu, and 4k pages and a IO load, map times increased
by about 15% and unmap times increased by about 58%. In essence, this
trades slower map/unmap times for improved IO throughput.
This introduces KVM_PGTABLE_MAX_BLOCK_LEVEL, and then uses it to limit
the size of stage2_apply_range() chunks to the smallest size that's
addressable via a block mapping -- 2mb on a 4k granule size.
Cc: <stable@xxxxxxxxxxxxxxx> # 5.15.x: 3b5c082bbfa2: KVM: arm64: Work out supported block level at compile time
Cc: <stable@xxxxxxxxxxxxxxx> # 5.15.x: 5994bc9e05c2: KVM: arm64: Limit stage2_apply_range() batch size to largest block
Cc: <stable@xxxxxxxxxxxxxxx> # 5.15.x
Suggested-by: Ali Saidi <alisaidi@xxxxxxxxxx>
Signed-off-by: Krister Johansen <kjlx@xxxxxxxxxxxxxxxxxx>
---
arch/arm64/include/asm/kvm_pgtable.h | 4 ++++
arch/arm64/kvm/mmu.c | 2 +-
2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 19278dfe7978..b0c4651a4d9a 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -19,11 +19,15 @@
* - 4K (level 1): 1GB
* - 16K (level 2): 32MB
* - 64K (level 2): 512MB
+ *
+ * The max block level is the _smallest_ supported block size for KVM.
*/
#ifdef CONFIG_ARM64_4K_PAGES
#define KVM_PGTABLE_MIN_BLOCK_LEVEL 1
+#define KVM_PGTABLE_MAX_BLOCK_LEVEL 2
#else
#define KVM_PGTABLE_MIN_BLOCK_LEVEL 2
+#define KVM_PGTABLE_MAX_BLOCK_LEVEL KVM_PGTABLE_MIN_BLOCK_LEVEL
#endif
#define kvm_lpa2_is_enabled() system_supports_lpa2()
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index dc04bc767865..1e927b306aee 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -41,7 +41,7 @@ static phys_addr_t __stage2_range_addr_end(phys_addr_t addr, phys_addr_t end,
static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
{
- phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL);
+ phys_addr_t size = kvm_granule_size(KVM_PGTABLE_MAX_BLOCK_LEVEL);
return __stage2_range_addr_end(addr, end, size);
}
--
2.25.1