Re: WARN_ON_ONCE() hit at kernel/events/core.c:330

From: Thomas-Mich Richter
Date: Thu Apr 04 2019 - 05:26:07 EST


On 4/3/19 12:41 PM, Peter Zijlstra wrote:
> On Wed, Apr 03, 2019 at 11:47:00AM +0200, Thomas-Mich Richter wrote:
>> I use linux 5.1.0-rc3 on s390 and got this WARN_ON_ONCE message:
>>
>> WARNING: CPU: 15 PID: 0 at kernel/events/core.c:330
>> event_function_local.constprop.79+0xe2/0xe8
>>
>> which was introduced with
>> commit cca2094605ef ("perf/core: Fix event_function_local()").
>> ..snip....
>>
>> Any ideas or hints who to avoid/fix this warning?
>
> Some thoughts here:
>
> https://lkml.kernel.org/r/20190213101644.GN32534@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
>
> tl;dr, I've no frigging clue.
>

I have read this thread and at the end you mentioned:

Humm, but in that case:

context_switch()
prepare_task_switch()
perf_event_task_sched_out()
__perf_event_task_sched_out()
perf_event_context_sched_out()
task_ctx_sched_out()
ctx_sched_out()
group_sched_out()
event_sched_out()
if (event->pending_disable)

Would have already cleared the pending_disable state, so the IPI would
not have ran perf_event_disable_local() in the first place.

Our test system is configured to panic in WARN_ON_ONCE(). I looked
at the dump. The event triggering WARN_ON_ONCE:

crash> struct perf_event.oncpu 0x1f9b24800
oncpu = 0
crash> struct perf_event.state 0x1f9b24800
state = PERF_EVENT_STATE_ACTIVE
crash>

This means the code in
static void event_sched_out(....)
{
....
event->pmu->del(event, 0);
event->oncpu = -1;

if (event->pending_disable) {
event->pending_disable = 0;
state = PERF_EVENT_STATE_OFF;
}
perf_event_set_state(event, state);
...
}

has not finished and returned from this function. So the task was not completely context-switched
out from CPU 0 while the interrupt handler was executing on CPU 15:

static void perf_pending_event(...)
{
....
if (event->pending_disable) {
event->pending_disable = 0;
perf_event_disable_local(event); <--- Causes the WARN_ON_ONCE()
}
.....
}

I think there is a race, especially when the interrupt handler on CPU 15
was invoked via timer interrupt an runs on a different CPU.

>
> Does it reproduce on x86 without virt on? I don't have a PPC LPAR to
> test things on.
>

s390 LPARs run under hipervisor control, no chance to run any OS without it.

--
Thomas Richter, Dept 3252, IBM s390 Linux Development, Boeblingen, Germany
--
Vorsitzender des Aufsichtsrats: Matthias Hartmann
GeschÃftsfÃhrung: Dirk Wittkopp
Sitz der Gesellschaft: BÃblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294