[RH72 Spectre] ibpb_enabled = 1 leads to hard LOCKUP under x86_64 host machine

From: Hou Tao
Date: Sat Jan 20 2018 - 04:04:03 EST


Hi all,

We are testing the patches for Spectre and Meltdown under OS derived from RH7.2,
and hit by a hard LOCKUP panic under a x86_64 host environment.

The hard LOCKUP can be reproduced, and it will gone if we disable ibpb by
writing 0 to ibpb_enabled file, and it will appear again when we enable ibpb
( writing 1 or 2).

The workload running on the host is just starting two hundreds security
containers sequentially, then stopping them and repeating. The security
container is implemented by using docker and kvm, so there will be many
"docker-containerd-shim" and "qemu-system-x86_64uvm" processes. The reproduction
of the hard LOCKUP problem can be accelerated by running the following command
("hackbench" comes from ltp project):
while true; do ./hackbench 100 process 1000; done

We have saved vmcore files for the hard LOCKUPs by using kdump. The hard LOCKUPs
are triggerd by different processes and on different Linux kernel stack. We have
analyzed one hard LOCKUP, it is caused by wake_up_new_task() when it tried to
get rq->lock by invoking __task_rq_lock(). The value of the lock is 422320416
(head = 6432, tail = 6444), and we have found the five processes which are
waiting on the lock, but we can not find the process which had taken it.

We guess maybe something is wrong with the CPU scheduler, because the RSP
register of process runv which is waiting for rq->lock is incorrect. The RSP
pointers the stack of swapper/57 and runv is also running on CPU 57 (more
details in the end of the mail). The same phenomenon exists on others hardLOCKs.

So has anyone encountered a similar problem before, and any suggestions
and directions for the hard LOCKUP problems ?

Thanks,
Tao

---
The following lines are output from one instance of the hard LOCKUP panics:

* output from crash which complain about the unexpected RSP register:

crash: inconsistent active task indications for CPU 57:
runqueue: ffff882eac72e780 "runv" (default)
current_task: ffff882f768e1700 "swapper/57"

crash> runq -m -c 57
CPU 57: [0 00:00:00.000] PID: 8173 TASK: ffff882eac72e780 COMMAND: "runv"
crash> bt 8173
PID: 8173 TASK: ffff882eac72e780 CPU: 57 COMMAND: "runv"
#0 [ffff885fbe145e00] stop_this_cpu at ffffffff8101f66d
#1 [ffff885fbe145e10] kbox_rlock_stop_other_cpus_call at ffffffffa031e649
#2 [ffff885fbe145e50] smp_nmi_call_function_handler at ffffffff81047dd6
#3 [ffff885fbe145e68] nmi_handle at ffffffff8164fc09
#4 [ffff885fbe145eb0] do_nmi at ffffffff8164fd84
#5 [ffff885fbe145ef0] end_repeat_nmi at ffffffff8164eff9
[exception RIP: _raw_spin_lock+48]
RIP: ffffffff8164dc50 RSP: ffff882f768f3b18 RFLAGS: 00000002
RAX: 0000000000000a58 RBX: ffff882f76f1d080 RCX: 0000000000001920
RDX: 0000000000001922 RSI: 0000000000001922 RDI: ffff885fbe159580
RBP: ffff882f768f3b18 R8: 0000000000000012 R9: 0000000000000001
R10: 0000000000000400 R11: 0000000000000000 R12: ffff882f76f1d884
R13: 0000000000000046 R14: ffff885fbe159580 R15: 0000000000000039
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#6 [ffff882f768f3b18] _raw_spin_lock at ffffffff8164dc50
bt: cannot transition from exception stack to current process stack:
exception stack pointer: ffff885fbe145e00
process stack pointer: ffff882f768f3b18
current stack base: ffff882f34e38000

* kernel panic message when hard LOCKUP occurs

[ 4396.807556] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 55
[ 4396.807561] CPU: 55 PID: 8267 Comm: docker Tainted: G O ---- ------- 3.10.0-327.59.59.46.x86_64 #1
[ 4396.807563] Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 1.57 08/11/2015
[ 4396.807564] Call Trace:
[ 4396.807571] <NMI> [<ffffffff81646140>] dump_stack+0x19/0x1b
[ 4396.807575] [<ffffffff8163f792>] panic+0xd8/0x214
[ 4396.807582] [<ffffffff811228b1>] watchdog_overflow_callback+0xd1/0xe0
[ 4396.807589] [<ffffffff81166161>] __perf_event_overflow+0xa1/0x250
[ 4396.807595] [<ffffffff81166c34>] perf_event_overflow+0x14/0x20
[ 4396.807600] [<ffffffff810339b8>] intel_pmu_handle_irq+0x1e8/0x470
[ 4396.807610] [<ffffffff812ffc11>] ? ioremap_page_range+0x241/0x320
[ 4396.807617] [<ffffffff813a1044>] ? ghes_copy_tofrom_phys+0x124/0x210
[ 4396.807621] [<ffffffff813a11d0>] ? ghes_read_estatus+0xa0/0x190
[ 4396.807626] [<ffffffff8165058b>] perf_event_nmi_handler+0x2b/0x50
[ 4396.807629] [<ffffffff8164fc09>] nmi_handle.isra.0+0x69/0xb0
[ 4396.807633] [<ffffffff8164fd84>] do_nmi+0x134/0x410
[ 4396.807637] [<ffffffff8164eff9>] end_repeat_nmi+0x1e/0x7e
[ 4396.807643] [<ffffffff8164dc5a>] ? _raw_spin_lock+0x3a/0x50
[ 4396.807648] [<ffffffff8164dc5a>] ? _raw_spin_lock+0x3a/0x50
[ 4396.807653] [<ffffffff8164dc5a>] ? _raw_spin_lock+0x3a/0x50
[ 4396.807658] <<EOE>> [<ffffffff810bd33c>] wake_up_new_task+0x9c/0x170
[ 4396.807662] [<ffffffff8107dfbb>] do_fork+0x13b/0x320
[ 4396.807667] [<ffffffff8107e226>] SyS_clone+0x16/0x20
[ 4396.807672] [<ffffffff816577f4>] stub_clone+0x44/0x70
[ 4396.807676] [<ffffffff8165743d>] ? system_call_fastpath+0x16/0x1b

* cpu info for the first CPU (72 CPUs in total)

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
stepping : 2
microcode : 0x3b
cpu MHz : 2300.000
cache size : 46080 KB
physical id : 0
siblings : 36
core id : 0
cpu cores : 18
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 15
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm arat epb invpcid_single pln pts dtherm spec_ctrl ibpb_support tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc
bogomips : 4589.42
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management: