Re: WARNING in event_function_local

From: Kees Cook
Date: Tue Feb 12 2019 - 20:28:16 EST


On Wed, May 9, 2018 at 10:18 PM Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote:
>
> On Sat, Jan 27, 2018 at 8:58 PM, syzbot
> <syzbot+352bd10e338d9a90e5e0@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> > Hello,
> >
> > syzbot hit the following crash on bpf-next commit
> > 8223967fe0b8eb2448cca5cfe3c64a0838e6f60d (Sat Jan 27 01:06:23 2018 +0000)
> > Merge branch 'fix-lpm-map'
> >
> > So far this crash happened 2 times on bpf-next.
> > C reproducer is attached.
> > syzkaller reproducer is attached.
> > Raw console output is attached.
> > compiler: gcc (GCC) 7.1.1 20170620
> > .config is attached.
> >
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+352bd10e338d9a90e5e0@xxxxxxxxxxxxxxxxxxxxxxxxx
> > It will help syzbot understand when the bug is fixed. See footer for
> > details.
> > If you forward the report, please keep this part and the footer.
>
>
> This was bisected to:
>
> commit 4ee7c60de83ac01fa4c33c55937357601631e8ad
> Author: Steven Rostedt (VMware) <rostedt@xxxxxxxxxxx>
> Date: Fri Mar 23 10:18:03 2018 -0400
>
> init, tracing: Add initcall trace events
>
>
>
> > WARNING: CPU: 1 PID: 5105 at kernel/events/core.c:331
> > event_function_local.constprop.102+0x494/0x560 kernel/events/core.c:331
> > Kernel panic - not syncing: panic_on_warn set ...
> >
> > CPU: 1 PID: 5105 Comm: syzkaller318382 Not tainted 4.15.0-rc8+ #8
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 01/01/2011
> > Call Trace:
> > <IRQ>
> > __dump_stack lib/dump_stack.c:17 [inline]
> > dump_stack+0x194/0x257 lib/dump_stack.c:53
> > panic+0x1e4/0x41c kernel/panic.c:183
> > __warn+0x1dc/0x200 kernel/panic.c:547
> > report_bug+0x211/0x2d0 lib/bug.c:184
> > fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
> > fixup_bug arch/x86/kernel/traps.c:247 [inline]
> > do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
> > do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
> > invalid_op+0x22/0x40 arch/x86/entry/entry_64.S:1096
> > RIP: 0010:event_function_local.constprop.102+0x494/0x560
> > kernel/events/core.c:331
> > RSP: 0018:ffff8801db307d00 EFLAGS: 00010006
> > RAX: ffff8801d902a180 RBX: ffff8801d60beac0 RCX: ffffffff81862a54
> > RDX: 0000000000010000 RSI: ffff8801db3079f0 RDI: ffff8801d60bebc8
> > RBP: ffff8801db307da0 R08: 1ffff1003b660edf R09: 000000000000000c
> > R10: ffff8801db307bb8 R11: 0000000000000044 R12: ffff8801db330e00
> > R13: 1ffff1003b660fa3 R14: ffff8801d90841c0 R15: ffff8801d979ce40
> > perf_event_disable_local kernel/events/core.c:1979 [inline]
> > perf_pending_event+0x1e5/0x240 kernel/events/core.c:5446
> > irq_work_run_list+0x184/0x240 kernel/irq_work.c:155
> > irq_work_run+0x1d/0x50 kernel/irq_work.c:170
> > smp_irq_work_interrupt+0x103/0x640 arch/x86/kernel/irq_work.c:21
> > irq_work_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:968
> > </IRQ>

I got curious why this report went silent, and bisected the "fix" to:

214ff83d4473 ("KVM: x86: hyperv: implement PV IPI send hypercalls")

This contains new tracepoints, and reverting them re-exposes the bug:

diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 89d20ed1d2e8..462fab402bfa 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -1468,15 +1468,21 @@ static u64 kvm_hv_send_ipi(struct kvm_vcpu
*current_vcpu, u64 ingpa, u64 outgpa,
all_cpus = false;
valid_bank_mask = BIT_ULL(0);

+ panic("how am I using hyperV?!");
+ /*
trace_kvm_hv_send_ipi(vector, sparse_banks[0]);
+ */
} else {
if (unlikely(kvm_read_guest(kvm, ingpa, &send_ipi_ex,
sizeof(send_ipi_ex))))
return HV_STATUS_INVALID_HYPERCALL_INPUT;

+ panic("how am I using hyperV?! (ex)");
+ /*
trace_kvm_hv_send_ipi_ex(send_ipi_ex.vector,
send_ipi_ex.vp_set.format,
send_ipi_ex.vp_set.valid_bank_mask);
+ */

vector = send_ipi_ex.vector;
valid_bank_mask = send_ipi_ex.vp_set.valid_bank_mask;
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 6432d08c7de7..895bd184feea 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -1424,6 +1424,7 @@ TRACE_EVENT(kvm_hv_flush_tlb_ex,
/*
* Tracepoints for kvm_hv_send_ipi.
*/
+#if 0
TRACE_EVENT(kvm_hv_send_ipi,
TP_PROTO(u32 vector, u64 processor_mask),
TP_ARGS(vector, processor_mask),
@@ -1462,6 +1463,7 @@ TRACE_EVENT(kvm_hv_send_ipi_ex,
__entry->vector, __entry->format,
__entry->valid_bank_mask)
);
+#endif
#endif /* _TRACE_KVM_H */

#undef TRACE_INCLUDE_PATH

Steve, what could possibly be happening here? Just adding more
tracepoints causes some kind of race where the task vs current test
trips in event_function_local():

if (WARN_ON_ONCE(task != current))
goto unlock;

Is this maybe just an unlucky condition with the event loop running in
an IRQ? Should the WARN be expected, or is running under an IRQ
unexpected?

It'd be nice to find the source of this, since it seems to have gotten
papered over instead of actually fixed. (Especially since the original
"introduction" bisect shows similar "just adding new trace events.)

--
Kees