Re: [PATCH] smp/call: Detect stuck CSD locks

From: Chris J Arges
Date: Wed Apr 29 2015 - 17:09:20 EST

Next message: Shaohua Li: "Re: [PATCH] blk-mq: rationalize plug"
Previous message: Kirill A. Shutemov: "Re: Regression: Requiring CAP_SYS_ADMIN for /proc/<pid>/pagemap causes application-level breakage"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Apr 16, 2015 at 06:31:40PM +0200, Ingo Molnar wrote:

<snip>

> Now what would be nice is to observe it whether the CPU that is not
> doing the CSD wait is truly locked up.
>
> It might be executing random KVM-ish workloads and the various
> backtraces we've seen so far are just a random sample of those
> workloads (from L1's perspective).
>
> Yet the fact that the kdump's NMI gets through is a strong indication
> that the CPU's APIC is fine: NMIs are essentially IPIs too, they just
> go to the NMI vector, which punches through irqs-off regions.
>
> So maybe another debug trick would be useful: instead of re-sending
> the IPI, send a single non-destructive NMI every second or so,
> creating a backtrace on the other CPU. From that we'll be able to see
> whether it's locked up permanently in an irqs-off section.
>
> I.e. basically you could try to trigger the 'show NMI backtraces on
> all CPUs' logic when the lockup triggers, and repeat it every couple
> of seconds.
>
> The simplest method to do that would be to call:
>
> trigger_all_cpu_backtrace();
>
> every couple of seconds, in the CSD polling loop - after the initial
> timeout has passed. I'd suggest to collect at least 10 pairs of
> backtraces that way.
>
> Thanks,
>
> Ingo
>

Ingo,

Ok, finally got some good data and some analysis for this issue.

I've uploaded my data here:
http://people.canonical.com/~arges/csd_hang/

Patch I used for these experiments is here:
http://people.canonical.com/~arges/csd_hang/csd-lock-debug.patch

The actual dmesg/bt output are in the timestamped directory.

One thing I tried to do was observe how long it took until we got into an
actual softlockup that led into a full system lockup. The maximum time I saw
that a kernel thread spent in csd_lock and it not being a full system hang was
0.017s. By detecting the hang earlier I was able to use things like ftrace to
actually see the IPI call in action that hangs the machine.

* Experiment 1:

For every lockup detected generate NMIs for all cpus using
'trigger_all_cpu_backtrace()'. I used a csd_lock_timeout value of 1s.

This was perhaps inconclusive. If I added the call when we detect a hang, I
never got the watchdog to detect a softlockup. Furthermore the machine became
unresponsive when running this experiment. Breakpoint using gdb/qemu showed
that the machine was still hitting the csd timeout code over and over again,
from there I was able to 'call panic()' and get a crashdump.

In 201504291924, 201504292008 it seems like the backtrace is causing
csd_lock_wait to detect a hang and then backtrace again.

* Experiment 2:

A bit more interesting was using ftrace and dumping the output on a detected
lockup in csd_lock_wait.

The csd_lock_timeout value was set to 0.018s, and ftrace was dumped on oops.
Using this method I collected 10 traces in which most had ftrace output.

In 201504282127 we see a normal flush_tlb_page call followed by the interrupt
handler and then calling the requested function 'flush_tlb_func' on the other
CPU:

[ 603.248016] 2452.083739 | 0) | ptep_clear_flush() {
[ 603.248016] 2452.083739 | 0) | flush_tlb_page() {
[ 603.248016] 2452.083739 | 0) 0.071 us | leave_mm();
[ 603.248016] 2452.083739 | 0) | native_flush_tlb_others() {
[ 603.248016] 2452.083740 | 0) | smp_call_function_many() {
[ 603.248016] 2452.083740 | 0) | smp_call_function_single() {
[ 603.248016] 2452.083740 | 0) | generic_exec_single() {
[ 603.248016] 2452.083740 | 0) | native_send_call_func_single_ipi() {
[ 603.248016] 2452.083740 | 0) | x2apic_send_IPI_mask() {
[ 603.248016] 2452.083741 | 0) 1.406 us | __x2apic_send_IPI_mask();
[ 603.248016] 2452.083743 | 0) 1.958 us | }
[ 603.248016] 2452.083743 | 0) 2.349 us | }
[ 603.248016] 2452.083743 | 0) 2.823 us | }
[ 603.248016] 2452.083743 | 0) + 13.132 us | csd_lock_wait.isra.4();
[ 603.248016] 2452.083746 | 1) + 25.238 us | }
[ 603.248016] 2452.083747 | 1) 0.987 us | vmx_read_l1_tsc();
[ 603.248016] 2452.083749 | 1) | vmx_handle_external_intr() {
[ 603.248016] 2452.083752 | 1) | smp_call_function_single_interrupt() {
[ 603.248016] 2452.083752 | 1) 0.063 us | kvm_guest_apic_eoi_write();
[ 603.248016] 2452.083753 | 1) | irq_enter() {
[ 603.248016] 2452.083753 | 1) 0.077 us | rcu_irq_enter();
[ 603.248016] 2452.083754 | 1) 0.751 us | }
[ 603.248016] 2452.083754 | 1) | generic_smp_call_function_single_interrupt() {
[ 603.248016] 2452.083754 | 1) | flush_smp_call_function_queue() {
[ 603.248016] 2452.083755 | 1) 0.459 us | flush_tlb_func();
[ 603.248016] 2452.083756 | 1) 0.089 us | csd_unlock();
[ 603.248016] 2452.083757 | 1) 2.180 us | }
[ 603.248016] 2452.083757 | 0) + 16.726 us | }
[ 603.248016] 2452.083757 | 1) 2.826 us | }
[ 603.248016] 2452.083757 | 0) + 17.236 us | }
[ 603.248016] 2452.083757 | 0) + 17.722 us | }

Later in the trace we see the same call followed by vmx_handle_external_intr()
ignoring the call:

[ 603.248016] 2452.083823 | 0) | ptep_clear_flush() {
[ 603.248016] 2452.083824 | 0) | flush_tlb_page() {
[ 603.248016] 2452.083824 | 0) 0.109 us | leave_mm();
[ 603.248016] 2452.083824 | 0) | native_flush_tlb_others() {
[ 603.248016] 2452.083824 | 0) | smp_call_function_many() {
[ 603.248016] 2452.083825 | 0) | smp_call_function_single() {
[ 603.248016] 2452.083825 | 0) | generic_exec_single() {
[ 603.248016] 2452.083825 | 0) | native_send_call_func_single_ipi() {
[ 603.248016] 2452.083825 | 0) | x2apic_send_IPI_mask() {
[ 603.248016] 2452.083826 | 0) 1.625 us | __x2apic_send_IPI_mask();
[ 603.248016] 2452.083828 | 0) 2.173 us | }
[ 603.248016] 2452.083828 | 0) 2.588 us | }
[ 603.248016] 2452.083828 | 0) 3.082 us | }
[ 603.248016] 2452.083828 | 0) | csd_lock_wait.isra.4() {
[ 603.248016] 2452.083848 | 1) + 44.033 us | }
[ 603.248016] 2452.083849 | 1) 0.975 us | vmx_read_l1_tsc();
[ 603.248016] 2452.083851 | 1) 1.031 us | vmx_handle_external_intr();
[ 603.248016] 2452.083852 | 1) 0.234 us | __srcu_read_lock();
[ 603.248016] 2452.083853 | 1) | vmx_handle_exit() {
[ 603.248016] 2452.083854 | 1) | handle_ept_violation() {
[ 603.248016] 2452.083856 | 1) | kvm_mmu_page_fault() {
[ 603.248016] 2452.083856 | 1) | tdp_page_fault() {
[ 603.248016] 2452.083856 | 1) 0.092 us | mmu_topup_memory_caches();
[ 603.248016] 2452.083857 | 1) | gfn_to_memslot_dirty_bitmap.isra.84() {
[ 603.248016] 2452.083857 | 1) 0.231 us | gfn_to_memslot();
[ 603.248016] 2452.083858 | 1) 0.774 us | }

So potentially, CPU0 generated an interrupt that caused vcpu_enter_guest to be
called on CPU1. However, when vmx_handle_external_intr was called, it didn't
progress any further.

Another experiment here would be to dump vmcs_read32(VM_EXIT_INTR_INFO); to see
why we don't handle the interrupt.

In 201504282155/dmesg.* we see a similar pattern:

[ 1578.184041] 4152.856104 | 1) | flush_tlb_page() {
[ 1578.184041] 4152.856105 | 1) 0.122 us | leave_mm();
[ 1578.184041] 4152.856105 | 1) | native_flush_tlb_others() {
[ 1578.184041] 4152.856106 | 1) | smp_call_function_many() {
[ 1578.184041] 4152.856106 | 1) | smp_call_function_single() {
[ 1578.184041] 4152.856107 | 1) | generic_exec_single() {
[ 1578.184041] 4152.856107 | 1) | native_send_call_func_single_ipi() {
[ 1578.184041] 4152.856107 | 1) | x2apic_send_IPI_mask() {
[ 1578.184041] 4152.856108 | 1) 2.770 us | __x2apic_send_IPI_mask();
[ 1578.184041] 4152.856111 | 1) 3.725 us | }
[ 1578.184041] 4152.856112 | 1) 4.383 us | }
[ 1578.184041] 4152.856112 | 1) 5.220 us | }
[ 1578.184041] 4152.856112 | 1) | csd_lock_wait.isra.4() {
[ 1578.184041] 4152.856149 | 0) + 57.757 us | }
[ 1578.184041] 4152.856150 | 0) 1.613 us | vmx_read_l1_tsc();
[ 1578.184041] 4152.856153 | 0) 1.566 us | vmx_handle_external_intr();
[ 1578.184041] 4152.856156 | 0) 0.360 us | __srcu_read_lock();
[ 1578.184041] 4152.856157 | 0) | vmx_handle_exit() {
[ 1578.184041] 4152.856158 | 0) | handle_wrmsr() {
[ 1578.184041] 4152.856158 | 0) | kvm_set_msr() {

201504282251 and 201504290305 didn't seem to ftrace properly.

201504290015 shows the same thing, but we get another interrupt through
afterwards that does work. The original call is what hangs the machine.

[ 4906.553939] 12500.903369 | 0) 1.736 us | __x2apic_send_IPI_mask();
[ 4906.556244] 12500.903371 | 0) 2.361 us | }
[ 4906.558222] 12500.903371 | 0) 2.785 us | }
[ 4906.560198] 12500.903371 | 0) 3.374 us | }
[ 4906.562131] 12500.903372 | 0) | csd_lock_wait.isra.4() {
[ 4906.564334] 12500.903391 | 1) + 50.198 us | }
[ 4906.566201] 12500.903392 | 1) 0.991 us | vmx_read_l1_tsc();
[ 4906.568284] 12500.903394 | 1) 0.977 us | vmx_handle_external_intr();
[ 4906.570436] 12500.903395 | 1) 0.260 us | __srcu_read_lock();
[ 4906.572727] 12500.903396 | 1) | vmx_handle_exit() {
[ 4906.574810] 12500.903397 | 1) | handle_wrmsr() {
[ 4906.576888] 12500.903397 | 1) | kvm_set_msr() {
[ 4906.578965] 12500.903397 | 1) | vmx_set_msr() {

<snip>

[ 4918.994035] 12500.913985 | 1) | native_send_call_func_single_ipi() {
[ 4918.996510] 12500.913986 | 1) | x2apic_send_IPI_mask() {
[ 4918.998809] 12500.913986 | 1) 2.050 us | __x2apic_send_IPI_mask();
[ 4919.001153] 12500.913989 | 1) 2.820 us | }
[ 4919.003186] 12500.913989 | 1) 3.444 us | }
[ 4919.005215] 12500.913989 | 1) 4.145 us | }
[ 4919.007180] 12500.913989 | 0) | smp_call_function_single_interrupt() {
[ 4919.009516] 12500.913989 | 1) 4.474 us | csd_lock_wait.isra.4();
[ 4919.011745] 12500.913990 | 0) 0.116 us | kvm_guest_apic_eoi_write();
[ 4919.013994] 12500.913990 | 0) | irq_enter() {
[ 4919.016092] 12500.913991 | 0) 0.115 us | rcu_irq_enter();
[ 4919.018210] 12500.913991 | 0) 0.607 us | }
[ 4919.020165] 12500.913991 | 0) | generic_smp_call_function_single_interrupt() {
[ 4919.022645] 12500.913992 | 0) | flush_smp_call_function_queue() {
[ 4919.025005] 12500.913992 | 0) | flush_tlb_func() {
[ 4919.027223] 12500.913993 | 0) 0.348 us | leave_mm();

201504290245 and 201504290342 aren't very clear, we send the interrupt, then
CPU1 gets a timer interrupt right away. CPU0 shows lost events, so perhaps we
missed what actually happened in the ftrace output.

[ 9504.605612] 22133.181730 | 1) 3.307 us | __x2apic_send_IPI_mask();
[ 9504.605616] 22133.181734 | 1) 4.081 us | }
[ 9504.605619] 22133.181735 | 1) 4.735 us | }
[ 9504.605622] 22133.181735 | 1) 5.744 us | }
[ 9504.605626] 22133.181735 | 1) | csd_lock_wait.isra.4() {
[ 9504.605632] 22133.185489 | 1) ==========> |.22133.185489 | 1) | smp_apic_timer_interrupt() {
[ 9504.605637] 22133.185489 | 1) 0.088 us | kvm_guest_apic_eoi_write();
[ 9504.605641] 22133.185490 | 1) | irq_enter() {
[ 9504.605646] 22133.185490 | 1) 0.130 us | rcu_irq_enter();
<snip>
[ 9504.606348] CPU:0 [LOST 902542527 EVENTS].22133.188647 | 0) 0.096 us | } /* inc_zone_page_state */
[ 9504.606354] 22133.188647 | 0) 0.088 us | mem_cgroup_end_page_stat();

201504290305 shows the call, and it doesn't even seem like CPU1 ever responds
to it.

[ 643.522880] 22837.075483 | 0) | flush_tlb_page() {
[ 643.522885] 22837.075484 | 1) 0.160 us | _raw_spin_unlock_irqrestore();
[ 643.522890] 22837.075485 | 0) | native_flush_tlb_others() {
[ 643.522895] 22837.075485 | 1) 0.083 us | _raw_spin_unlock_irqrestore();
[ 643.522899] 22837.075486 | 0) | smp_call_function_many() {
[ 643.522904] 22837.075486 | 1) 0.145 us | mem_cgroup_update_page_stat();
[ 643.522908] 22837.075487 | 0) | smp_call_function_single() {
[ 643.522913] 22837.075487 | 1) 0.255 us | dec_zone_page_state();
[ 643.522918] 22837.075487 | 0) 0.288 us | generic_exec_single();
[ 643.522923] 22837.075488 | 1) 0.336 us | inc_zone_page_state();
[ 643.522927] 22837.075489 | 0) | csd_lock_wait.isra.4() {
[ 643.522931] 22837.075490 | 1) 0.115 us | mem_cgroup_end_page_stat();
[ 643.522935] 22837.075491 | 1) + 14.409 us | }
[ 643.522940] 22837.075491 | 1) 0.113 us | __wake_up_bit();
[ 643.522943] 22837.075492 | 1) + 16.781 us | }
[ 643.522948] 22837.075493 | 1) | end_page_writeback() {
[ 643.522952] 22837.075494 | 1) | test_clear_page_writeback() {
[ 643.522957] 22837.075494 | 1) 0.110 us | page_mapping();
[ 643.522962] 22837.075495 | 1) 0.135 us | mem_cgroup_begin_page_stat();
[ 643.522967] 22837.075496 | 1) | inode_to_bdi() {
[ 643.522972] 22837.075497 | 1) 0.115 us | sb_is_blkdev_sb();

Thanks,
--chris j arges

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Shaohua Li: "Re: [PATCH] blk-mq: rationalize plug"
Previous message: Kirill A. Shutemov: "Re: Regression: Requiring CAP_SYS_ADMIN for /proc/<pid>/pagemap causes application-level breakage"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]