Re: [RFC][PATCH] arm64: tlb: call kvm_call_hyp once during kvm_tlb_flush_vmid_range

From: yezhenyu (A)

Date: Thu Feb 12 2026 - 07:02:55 EST

Thanks for your review.

On 2026/2/9 22:35, Marc Zyngier wrote:

On Mon, 09 Feb 2026 13:14:07 +0000,
"yezhenyu (A)" <yezhenyu2@xxxxxxxxxx> wrote:

From 9982be89f55bd99b3683337223284f0011ed248e Mon Sep 17 00:00:00 2001
From: eillon <yezhenyu2@xxxxxxxxxx>
Date: Mon, 9 Feb 2026 19:48:46 +0800
Subject: [RFC][PATCH v1] arm64: tlb: call kvm_call_hyp once during
kvm_tlb_flush_vmid_range

The kvm_tlb_flush_vmid_range() function is performance-critical
during live migration, but there is a while loop when the system
support flush tlb by range when the size is larger than MAX_TLBI_RANGE_PAGES.

This results in frequent entry to kvm_call_hyp() and then a large

What is the cost of kvm_call_hyp()?

Most cost of kvm_tlb_flush_vmid_range() is __tlb_switch_to_host(), which
is called in every __kvm_tlb_flush_vmid/__kvm_tlb_flush_vmid_range.

amount of time is spent in kvm_clear_dirty_log_protect() during
migration(more than 50%).

50% of what time? The guest's run-time? The time spent doing TLBIs
compared to the time spent in kvm_clear_dirty_log_protect()?

kvm_clear_dirty_log_protect() cost more than 50% time during
ram_find_and_save_block(), but not every time.
I captured the flame graph during the live migration, and the
distribution of several key functions is as follows(sorry I
cannot transfer the SVG files outside my company):

ram_find_and_save_block(): 84.01%
memory_region_clear_dirty_bitmap(): 33.40%
kvm_clear_dirty_log_protect(): 26.74%
kvm_arch_flush_remote_tlbs_range(): 9.67%
__tlb_switch_to_host(): 9.51%
kvm_arch_mmu_enable_log_dirty_pt_masked(): 9.38%
ram_save_target_page_legacy(): 43.41%

The memory_region_clear_dirty_bitmap() cost about 40% of
ram_find_and_save_block(), and the kvm_arch_flush_remote_tlbs_range()
cost about 29% of memory_region_clear_dirty_bitmap().

And after the patch apply, the distribution of several key functions is
as follows:

ram_find_and_save_block(): 53.84%
memory_region_clear_dirty_bitmap(): 2.28%
kvm_clear_dirty_log_protect(): 1.75%
kvm_arch_flush_remote_tlbs_range(): 0.03%
__tlb_switch_to_host(): 0.03%
kvm_arch_mmu_enable_log_dirty_pt_masked(): 0.96%
ram_save_target_page_legacy(): 38.97%

The memory_region_clear_dirty_bitmap() cost about 4% of
ram_find_and_save_block(), and the kvm_arch_flush_remote_tlbs_range()
cost about 1% of memory_region_clear_dirty_bitmap().

So, when the address range is large than
MAX_TLBI_RANGE_PAGES, directly call __kvm_tlb_flush_vmid to
optimize performance.

Multiple things here:

- there is no SoB, which means that patch cannot be considered for
merging

If there are no other issues with this patch, I can resend it with the
SoB (Signed-off-by) tag.

- there is no data showing how this change improves the situation for
a large enough set of workloads

- there is no description of a test that could be run on multiple
implementations to check whether this change has a positive or
negative impact

This patch affected the migration bandwidth during the live migration.
With the same physical bandwidth, the optimization effect of this patch
can be observed by monitoring the real live migration bandwidth.

I have test this in an RDMA-like environment, the physical bandwidth is
about 100GBps; without this patch, the migration bandwidth is below 10
GBps, and after this patch apply, the migration bandwidth can reach 50
GBps.

If you want to progress this sort of things, you will need to address
these points.

Thanks,

M.