Re: [PATCH] smp/call: Detect stuck CSD locks

From: Ingo Molnar
Date: Thu Apr 16 2015 - 07:04:39 EST



* Chris J Arges <chris.j.arges@xxxxxxxxxxxxx> wrote:

> Ingo,
>
> Below are the patches and data I've gathered from the reproducer. My
> methodology was as described previously; however I used gdb on the
> qemu process in order to breakpoint L1 once we've detected the hang.
> This made dumping the kvm_lapic structures on L0 more reliable.

Thanks!

So I have trouble interpreting the L1 backtrace, because it shows
something entirely new (to me).

First lets clarify the terminology, to make sure I got the workload
all right:

- L0 is the host kernel, running native Linux. It's not locking up.

- L1 is the guest kernel, running virtualized Linux. This is the one
that is locking up.

- L2 is the nested guest kernel, running whatever test workload you
were running - this is obviously locking up together with L1.

Right?

So with that cleared up, the backtrace on L1 looks like this:

> * Crash dump backtrace from L1:
>
> crash> bt -a
> PID: 26 TASK: ffff88013a4f1400 CPU: 0 COMMAND: "ksmd"
> #0 [ffff88013a5039f0] machine_kexec at ffffffff8109d3ec
> #1 [ffff88013a503a50] crash_kexec at ffffffff8114a763
> #2 [ffff88013a503b20] panic at ffffffff818068e0
> #3 [ffff88013a503ba0] csd_lock_wait at ffffffff8113f1e4
> #4 [ffff88013a503bf0] generic_exec_single at ffffffff8113f2d0
> #5 [ffff88013a503c60] smp_call_function_single at ffffffff8113f417
> #6 [ffff88013a503c90] smp_call_function_many at ffffffff8113f7a4
> #7 [ffff88013a503d20] flush_tlb_page at ffffffff810b3bf9
> #8 [ffff88013a503d50] ptep_clear_flush at ffffffff81205e5e
> #9 [ffff88013a503d80] try_to_merge_with_ksm_page at ffffffff8121a445
> #10 [ffff88013a503e00] ksm_scan_thread at ffffffff8121ac0e
> #11 [ffff88013a503ec0] kthread at ffffffff810df0fb
> #12 [ffff88013a503f50] ret_from_fork at ffffffff8180fc98

So this one, VCPU0, is trying to send an IPI to VCPU1:

> PID: 1674 TASK: ffff8800ba4a9e00 CPU: 1 COMMAND: "qemu-system-x86"
> #0 [ffff88013fd05e20] crash_nmi_callback at ffffffff81091521
> #1 [ffff88013fd05e30] nmi_handle at ffffffff81062560
> #2 [ffff88013fd05ea0] default_do_nmi at ffffffff81062b0a
> #3 [ffff88013fd05ed0] do_nmi at ffffffff81062c88
> #4 [ffff88013fd05ef0] end_repeat_nmi at ffffffff81812241
> [exception RIP: vmx_vcpu_run+992]
> RIP: ffffffff8104cef0 RSP: ffff88013940bcb8 RFLAGS: 00000082
> RAX: 0000000080000202 RBX: ffff880139b30000 RCX: ffff880139b30000
> RDX: 0000000000000200 RSI: ffff880139b30000 RDI: ffff880139b30000
> RBP: ffff88013940bd28 R8: 00007fe192b71110 R9: 00007fe192b71140
> R10: 00007fff66407d00 R11: 00007fe1927f0060 R12: 0000000000000000
> R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000000
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> --- <NMI exception stack> ---
> #5 [ffff88013940bcb8] vmx_vcpu_run at ffffffff8104cef0
> #6 [ffff88013940bcf8] vmx_handle_external_intr at ffffffff81040c18
> #7 [ffff88013940bd30] kvm_arch_vcpu_ioctl_run at ffffffff8101b5ad
> #8 [ffff88013940be00] kvm_vcpu_ioctl at ffffffff81007894
> #9 [ffff88013940beb0] do_vfs_ioctl at ffffffff81253190
> #10 [ffff88013940bf30] sys_ioctl at ffffffff81253411
> #11 [ffff88013940bf80] system_call_fastpath at ffffffff8180fd4d

So the problem here that I can see is that L1's VCPU1 appears to be
looping with interrupts disabled:

> RIP: ffffffff8104cef0 RSP: ffff88013940bcb8 RFLAGS: 00000082

Look how RFLAGS doesn't have 0x200 set - so it's executing with
interrupts disabled.

That is why the IPI does not get through to it, but kdump's NMI had no
problem getting through.

This (assuming all backtraces are exact!):

> #5 [ffff88013940bcb8] vmx_vcpu_run at ffffffff8104cef0
> #6 [ffff88013940bcf8] vmx_handle_external_intr at ffffffff81040c18
> #7 [ffff88013940bd30] kvm_arch_vcpu_ioctl_run at ffffffff8101b5ad

suggests that we called vmx_vcpu_run() from
vmx_handle_external_intr(), and that we are executing L2 guest code
with interrupts disabled.

How is this supposed to work? What mechanism does KVM have against a
(untrusted) guest interrupt handler locking up?

I might be misunderstanding how this works at the KVM level, but from
the APIC perspective the situation appears to be pretty clear: CPU1's
interrupts are turned off, so it cannot receive IPIs, the CSD wait
will eventually time out.

Now obviously it appears to be anomalous (assuming my analysis is
correct) that the interrupt handler has locked up, but it's
immaterial: a nested kernel must not allow its guest to lock it up.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/