Re: [PATCH V3 2/2] LoongArch: KVM: fix "unreliable stack" issue
From: lixianglai
Date: Mon Dec 29 2025 - 05:15:42 EST
Hi Jinyang:
On 2025-12-29 11:53, lixianglai wrote:
Hi Jinyang:
On 2025-12-27 09:27, Xianglai Li wrote:
Insert the appropriate UNWIND macro definition into the
kvm_exc_entry in
the assembly function to guide the generation of correct ORC table
entries,
thereby solving the timeout problem of loading the livepatch-sample
module
on a physical machine running multiple vcpus virtual machines.
While solving the above problems, we have gained an additional
benefit,
that is, we can obtain more call stack information
Stack information that can be obtained before the problem is fixed:
[<0>] kvm_vcpu_block+0x88/0x120 [kvm]
[<0>] kvm_vcpu_halt+0x68/0x580 [kvm]
[<0>] kvm_emu_idle+0xd4/0xf0 [kvm]
[<0>] kvm_handle_gspr+0x7c/0x700 [kvm]
[<0>] kvm_handle_exit+0x160/0x270 [kvm]
[<0>] kvm_exc_entry+0x100/0x1e0
Stack information that can be obtained after the problem is fixed:
[<0>] kvm_vcpu_block+0x88/0x120 [kvm]
[<0>] kvm_vcpu_halt+0x68/0x580 [kvm]
[<0>] kvm_emu_idle+0xd4/0xf0 [kvm]
[<0>] kvm_handle_gspr+0x7c/0x700 [kvm]
[<0>] kvm_handle_exit+0x160/0x270 [kvm]
[<0>] kvm_exc_entry+0x104/0x1e4
[<0>] kvm_enter_guest+0x38/0x11c
[<0>] kvm_arch_vcpu_ioctl_run+0x26c/0x498 [kvm]
[<0>] kvm_vcpu_ioctl+0x200/0xcf8 [kvm]
[<0>] sys_ioctl+0x498/0xf00
[<0>] do_syscall+0x98/0x1d0
[<0>] handle_syscall+0xb8/0x158
Cc: stable@xxxxxxxxxxxxxxx
Signed-off-by: Xianglai Li <lixianglai@xxxxxxxxxxx>
---
Cc: Huacai Chen <chenhuacai@xxxxxxxxxx>
Cc: WANG Xuerui <kernel@xxxxxxxxxx>
Cc: Tianrui Zhao <zhaotianrui@xxxxxxxxxxx>
Cc: Bibo Mao <maobibo@xxxxxxxxxxx>
Cc: Charlie Jenkins <charlie@xxxxxxxxxxxx>
Cc: Xianglai Li <lixianglai@xxxxxxxxxxx>
Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Cc: Tiezhu Yang <yangtiezhu@xxxxxxxxxxx>
arch/loongarch/kvm/switch.S | 28 +++++++++++++++++++---------
1 file changed, 19 insertions(+), 9 deletions(-)
diff --git a/arch/loongarch/kvm/switch.S b/arch/loongarch/kvm/switch.S
index 93845ce53651..a3ea9567dbe5 100644
--- a/arch/loongarch/kvm/switch.S
+++ b/arch/loongarch/kvm/switch.S
@@ -10,6 +10,7 @@
#include <asm/loongarch.h>
#include <asm/regdef.h>
#include <asm/unwind_hints.h>
+#include <linux/kvm_types.h>
#define HGPR_OFFSET(x) (PT_R0 + 8*x)
#define GGPR_OFFSET(x) (KVM_ARCH_GGPR + 8*x)
@@ -110,9 +111,9 @@
* need to copy world switch code to DMW area.
*/
.text
+ .p2align PAGE_SHIFT
.cfi_sections .debug_frame
SYM_CODE_START(kvm_exc_entry)
- .p2align PAGE_SHIFT
UNWIND_HINT_UNDEFINED
csrwr a2, KVM_TEMP_KS
csrrd a2, KVM_VCPU_KS
@@ -170,6 +171,7 @@ SYM_CODE_START(kvm_exc_entry)
/* restore per cpu register */
ld.d u0, a2, KVM_ARCH_HPERCPU
addi.d sp, sp, -PT_SIZE
+ UNWIND_HINT_REGS
/* Prepare handle exception */
or a0, s0, zero
@@ -200,7 +202,7 @@ ret_to_host:
jr ra
SYM_CODE_END(kvm_exc_entry)
-EXPORT_SYMBOL(kvm_exc_entry)
+EXPORT_SYMBOL_FOR_KVM(kvm_exc_entry)
/*
* int kvm_enter_guest(struct kvm_run *run, struct kvm_vcpu *vcpu)
@@ -215,6 +217,14 @@ SYM_FUNC_START(kvm_enter_guest)
/* Save host GPRs */
kvm_save_host_gpr a2
+ /*
+ * The csr_era member variable of the pt_regs structure is
required
+ * for unwinding orc to perform stack traceback, so we need to
put
+ * pc into csr_era member variable here.
+ */
+ pcaddi t0, 0
+ st.d t0, a2, PT_ERA
Hi, Xianglai,
It should use `SYM_CODE_START` to mark the `kvm_enter_guest` rather
than
`SYM_FUNC_START`, since the `SYM_FUNC_START` is used to mark "C-likely"
asm functionw.
Ok, I will use SYM_CODE_START to mark kvm_enter_guest in the next
version.
I guess the kvm_enter_guest is something like exception
handler becuase the last instruction is "ertn". So usually it should
mark UNWIND_HINT_REGS where can find last frame info by "$sp".
However, all info is store to "$a2", this mark should be
`UNWIND_HINT sp_reg=ORC_REG_A2(???) type=UNWIND_HINT_TYPE_REGS`.
I don't konw why save this function internal PC here by `pcaddi t0, 0`,
and I think it is no meaning(, for exception handler, they save last PC
by read CSR.ERA). The `kvm_enter_guest` saves registers by
"$a2"("$sp" - PT_REGS) beyond stack ("$sp"), it is dangerous if IE
is enable. So I wonder if there is really a stacktrace through this
function?
The stack backtracking issue in switch.S is rather complex because it
involves the switching between cpu root-mode and guest-mode:
Real stack backtracking should be divided into two parts:
part 1:
[<0>] kvm_enter_guest+0x38/0x11c
[<0>] kvm_arch_vcpu_ioctl_run+0x26c/0x498 [kvm]
[<0>] kvm_vcpu_ioctl+0x200/0xcf8 [kvm]
[<0>] sys_ioctl+0x498/0xf00
[<0>] do_syscall+0x98/0x1d0
[<0>] handle_syscall+0xb8/0x158
part 2:
[<0>] kvm_vcpu_block+0x88/0x120 [kvm]
[<0>] kvm_vcpu_halt+0x68/0x580 [kvm]
[<0>] kvm_emu_idle+0xd4/0xf0 [kvm]
[<0>] kvm_handle_gspr+0x7c/0x700 [kvm]
[<0>] kvm_handle_exit+0x160/0x270 [kvm]
[<0>] kvm_exc_entry+0x104/0x1e4
In "part 1", after executing kvm_enter_guest, the cpu switches from
root-mode to guest-mode.
In this case, stack backtracking is indeed very rare.
In "part 2", the cpu switches from the guest-mode to the root-mode,
and most of the stack backtracking occurs during this phase.
To obtain the longest call chain, we save pc in kvm_enter_guest to
pt_regs.csr_era,
and after restoring the sp of the root-mode cpu in kvm_exc_entry,
The ORC entry was re-established using "UNWIND_HINT_REGS",
and then we obtained the following stack backtrace as we wanted:
[<0>] kvm_vcpu_block+0x88/0x120 [kvm]
[<0>] kvm_vcpu_halt+0x68/0x580 [kvm]
[<0>] kvm_emu_idle+0xd4/0xf0 [kvm]
[<0>] kvm_handle_gspr+0x7c/0x700 [kvm]
[<0>] kvm_handle_exit+0x160/0x270 [kvm]
[<0>] kvm_exc_entry+0x104/0x1e4
I found this might be a coincidence—correct behavior due to the incorrect
UNWIND_HINT_REGS mark and unusual stack adjustment.
First, the kvm_enter_guest contains only a single branch instruction,
ertn.
It hardware-jump to the CSR.ERA address directly, jump into
kvm_exc_entry.
At this point, the stack layout looks like this:
-------------------------------
frame from call to `kvm_enter_guest`
------------------------------- <- $sp
PT_REGS
------------------------------- <- $a2
Then kvm_exc_entry adjust stack without save any register (e.g. $ra, $sp)
but still marked UNWIND_HINT_REGS.
After the adjustment:
-------------------------------
frame from call to `kvm_enter_guest`
-------------------------------
PT_REGS
------------------------------- <- $a2, new $sp
During unwinding, when the unwinder reaches kvm_exc_entry,
it meets the mark of PT_REGS and correctly recovers
pc = regs.csr_era, sp = regs.sp, ra = regs.ra
Yes, here unwinder does work as you say.
a) Can we avoid "ertn" rather than `jr reg (or jirl ra, reg, 0)`
instead, like call?
No, we need to rely on the 'ertn instruction return PIE to CRMD IE,
at the same time to ensure that its atomic,
there should be no other instruction than' ertn 'more appropriate here.
The kvm_exc_entry cannot back to kvm_enter_guest
if we use "ertn", so should the kvm_enter_guest appear on the stacktrace?
It is flexible. As I mentioned above, the cpu completes the switch from
host-mode to guest mode through kvm_enter_guest,
and then the switch from guest mode to host-mode through kvm_exc_entry.
When we ignore the details of the host-mode
and guest-mode switching in the middle, we can understand that the host
cpu has completed kvm_enter_guest->kvm_exc_entry.
From this perspective, I think it can exist in the call stack, and at
the same time, we have obtained the maximum call stack information.
b) Can we adjust $sp before entering kvm_exc_entry? Then we can mark
UNWIND_HINT_REGS at the beginning of kvm_exc_entry, which something
like ret_from_kernel_thread_asm.
The following command can be used to dump the orc entries of the kernel:
./tools/objtool/objtool --dump vmlinux
You can observe that not all orc entries are generated at the beginning
of the function.
For example:
handle_tlb_protect
ftrace_stub
handle_reserved
So, is it unnecessary for us to modify UNWIND_HINT_REGS in order to
place it at the beginning of the function.
If you have a better solution, could you provide an example of the
modification?
I can test the feasibility of the solution.
Thanks!
Xianglai.
[<0>] kvm_enter_guest+0x38/0x11c
[<0>] kvm_arch_vcpu_ioctl_run+0x26c/0x498 [kvm]
[<0>] kvm_vcpu_ioctl+0x200/0xcf8 [kvm]
[<0>] sys_ioctl+0x498/0xf00
[<0>] do_syscall+0x98/0x1d0
[<0>] handle_syscall+0xb8/0x158
Doing so is equivalent to ignoring the details of the cpu root-mode
and guest-mode switching.
About what you said in the IE enable phase is dangerous,
interrupts are always off during the cpu root-mode and guest-mode
switching in kvm_enter_guest and kvm_exc_entry.