Re: Crashes in perf_event_ctx_lock_nested

From: Peter Zijlstra
Date: Tue Oct 31 2017 - 09:49:03 EST


On Mon, Oct 30, 2017 at 03:45:12PM -0700, Guenter Roeck wrote:
> I added some logging and a long msleep() in hardlockup_detector_perf_cleanup().
> Here is the result:
>
> [ 0.274361] NMI watchdog: ############ hardlockup_detector_perf_init
> [ 0.274915] NMI watchdog: ############ hardlockup_detector_event_create(0)
> [ 0.277049] NMI watchdog: ############ hardlockup_detector_perf_cleanup
> [ 0.277593] NMI watchdog: ############ hardlockup_detector_perf_enable(0)
> [ 0.278027] NMI watchdog: ############ hardlockup_detector_event_create(0)
> [ 1.312044] NMI watchdog: ############ hardlockup_detector_perf_cleanup done
> [ 1.385122] NMI watchdog: ############ hardlockup_detector_perf_enable(1)
> [ 1.386028] NMI watchdog: ############ hardlockup_detector_event_create(1)
> [ 1.466102] NMI watchdog: ############ hardlockup_detector_perf_enable(2)
> [ 1.475536] NMI watchdog: ############ hardlockup_detector_event_create(2)
> [ 1.535099] NMI watchdog: ############ hardlockup_detector_perf_enable(3)
> [ 1.535101] NMI watchdog: ############ hardlockup_detector_event_create(3)

> [ 7.222816] NMI watchdog: ############ hardlockup_detector_perf_disable(0)
> [ 7.230567] NMI watchdog: ############ hardlockup_detector_perf_disable(1)
> [ 7.243138] NMI watchdog: ############ hardlockup_detector_perf_disable(2)
> [ 7.250966] NMI watchdog: ############ hardlockup_detector_perf_disable(3)
> [ 7.258826] NMI watchdog: ############ hardlockup_detector_perf_enable(1)
> [ 7.258827] NMI watchdog: ############ hardlockup_detector_perf_cleanup
> [ 7.258831] NMI watchdog: ############ hardlockup_detector_perf_enable(2)
> [ 7.258833] NMI watchdog: ############ hardlockup_detector_perf_enable(0)
> [ 7.258834] NMI watchdog: ############ hardlockup_detector_event_create(2)
> [ 7.258835] NMI watchdog: ############ hardlockup_detector_event_create(0)
> [ 7.260169] NMI watchdog: ############ hardlockup_detector_perf_enable(3)
> [ 7.260170] NMI watchdog: ############ hardlockup_detector_event_create(3)
> [ 7.494251] NMI watchdog: ############ hardlockup_detector_event_create(1)
> [ 8.287135] NMI watchdog: ############ hardlockup_detector_perf_cleanup done
>
> Looks like there are a number of problems: hardlockup_detector_event_create()
> creates the event data structure even if it is already created,

Right, that does look dodgy. And on its own should be fairly straight
forward to cure. But I'd like to understand the rest of it first.

> and hardlockup_detector_perf_cleanup() runs unprotected and in
> parallel to the enable/create functions.

Well, looking at the code, cpu_maps_update_begin() aka.
cpu_add_remove_lock is serializing cpu_up() and cpu_down() and _should_
thereby also serialize cleanup vs the smp_hotplug_thread operations.

Your trace does indeed indicate this is not the case, but I cannot, from
the code, see how this could happen.

Could you use trace_printk() instead and boot with
"trace_options=stacktrace" ?

> ALso, the following message is seen twice.
>
> [ 0.278758] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
> [ 7.258838] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
>
> I don't offer a proposed patch since I have no idea how to best solve the
> problem.
>
> Also, is the repeated enable/disable/cleanup as part of the normal boot
> really necessary ?

That's weird, I don't see that on my machines. We very much only bring
up the CPUs _once_. Also note they're 7s apart. Did you do something
funny like resume-from-disk or so?