Re: Temporary KVM guest hangs connected to KSM and NUMA balancer
From: Friedrich Weber
Date: Thu Jan 11 2024 - 07:43:40 EST
Hi,
On 04/01/2024 14:42, Friedrich Weber wrote:
> f47e5bbb ("KVM: x86/mmu: Zap only TDP MMU leafs in zap range and
> mmu_notifier unmap")
As this commit mentions tdp_mmu_zap_leafs, I ran the reproducer again on
610a9b8f (6.7-rc8) while tracing >500ms calls to that function (printing
a stacktrace) and task_numa_work via a bpftrace script [1].
Again, there are several invocations of task_numa_work that take >500ms
(the VM appears to be frozen during that time). For instance, one
invocation that takes ~20 seconds:
[1704971602] task_numa_work (tid=52035) took 19995 ms
For this particular thread and in the 20 seconds before that, there were
8 invocations of tdp_mmu_zap_leafs with >500ms:
[1704971584] tdp_mmu_zap_leafs (tid=52035) took 2291 ms
[1704971586] tdp_mmu_zap_leafs (tid=52035) took 2343 ms
[1704971589] tdp_mmu_zap_leafs (tid=52035) took 2316 ms
[1704971590] tdp_mmu_zap_leafs (tid=52035) took 1663 ms
[1704971591] tdp_mmu_zap_leafs (tid=52035) took 682 ms
[1704971594] tdp_mmu_zap_leafs (tid=52035) took 2706 ms
[1704971597] tdp_mmu_zap_leafs (tid=52035) took 3132 ms
[1704971602] tdp_mmu_zap_leafs (tid=52035) took 4846 ms
They roughly sum up to 20s. The stacktrace is the same for all:
bpf_prog_5ca52691cb9e9fbd_tdp_mmu_zap_lea+345
bpf_prog_5ca52691cb9e9fbd_tdp_mmu_zap_lea+345
bpf_trampoline_380104735946+208
tdp_mmu_zap_leafs+5
kvm_unmap_gfn_range+347
kvm_mmu_notifier_invalidate_range_start+394
__mmu_notifier_invalidate_range_start+156
change_protection+3908
change_prot_numa+105
task_numa_work+1029
bpf_trampoline_6442457341+117
task_numa_work+9
xfer_to_guest_mode_handle_work+261
kvm_arch_vcpu_ioctl_run+1553
kvm_vcpu_ioctl+667
__x64_sys_ioctl+164
do_syscall_64+96
entry_SYSCALL_64_after_hwframe+110
AFAICT this pattern repeats several times. I uploaded the last 150kb of
the bpftrace output to [2]. I can provide the full output if needed.
To me, not knowing much about the KVM/KSM/NUMA balancer interplay, this
looks like task_numa_work triggers several invocations of
tdp_mmu_zap_leafs, each of which takes an unusually long time. If anyone
has a hunch why this might happen, or an idea where to look next, it
would be much appreciated.
Best,
Friedrich
[1]
kfunc:tdp_mmu_zap_leafs { @start_zap[tid] = nsecs; }
kretfunc:tdp_mmu_zap_leafs /@start_zap[tid]/ {
$diff = nsecs - @start_zap[tid];
if ($diff > 500000000) { // 500ms
time("[%s] ");
printf("tdp_mmu_zap_leafs (tid=%d) took %d ms\n", tid, $diff / 1000000);
print(kstack());
}
delete(@start_zap[tid]);
}
kfunc:task_numa_work { @start_numa[tid] = nsecs; }
kretfunc:task_numa_work /@start_numa[tid]/ {
$diff = nsecs - @start_numa[tid];
if ($diff > 500000000) { // 500ms
time("[%s] ");
printf("task_numa_work (tid=%d) took %d ms\n", tid, $diff / 1000000);
}
delete(@start_numa[tid]);
}
[2] https://paste.debian.net/1303767/