Re: [patch 11/29] lockup_detector: Remove park_in_progress hackery

From: Thomas Gleixner
Date: Tue Sep 05 2017 - 09:58:39 EST


On Mon, 4 Sep 2017, Ulrich Obergfell wrote:
> "b94f51183b06 ("kernel/watchdog: prevent false hardlockup on overloaded
> system") tries to fix the following issue:
>
> watchdog_stop()
> hrtimer_cancel()
> perf_nmi_event_stop()
>
> If the task gets preempted after canceling the hrtimer or the VM gets
> scheduled out long enough, then there is a chance that the next NMI will
> see a stale hrtimer interrupt count and trigger a false positive hard
> lockup splat."
>
> This is not exactly the issue that b94f51183b06 is actually supposed to fix.
> For details, please see the accompanying commit log message at:
>
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/kernel/watchdog.c?id=b94f51183b0617e7b9b4fb4137d4cf1cab7547c2
>
> For convenience, here is a slightly different description of the scenario
> that b94f51183b06 originally aims to address:
>
> - A thread hogs CPU N (soft lockup) so that watchdog/N is unable to run.
> - A user re-configures 'watchdog_thresh' on the fly. The reconfiguration
> requires parking/unparking of all watchdog threads.
> - watchdog_timer_fn() on all CPUs can already pick up the new sample period,
> but the perf counter frequency cannot be updated on all CPUs yet because
> watchdog/N is unable to park, so watchdog_overflow_callback() still occurs
> at a frequency that is based on the old sample period (on CPUs >= N which
> have not had a chance to park their watchdog threads yet).
> - If 'watchdog_thresh' was increased by the reconfiguration, the interval
> at which watchdog_timer_fn() is now running could be too large for the
> watchdog_overflow_callback() function to observe a change of the timer
> interrupt counter. This can trigger a false positive.

Oh well. You did not fix that at all with that hack.

The real problem is that the watchdog threshold write in proc immediately
updates sample_period, which is evaluated by the running thread.

proc_write()
set_sample_period()
<----- Broken starts
proc_watchdog_update()
watchdog_enable_all_cpus()
update_watchdog_all_cpus()
watchdog_park_threads()
<----- Broken ends
watchdog_park_in_progress = 1

So up to the point where watchdog_park_in_progress is set to 1, the change
to sample_period is visible to all active watchdog threads and will be
picked up by the watchdog threads to rearm their timer. The NMI watchdog
will run with the old frequency until the thread is parked and the watchdog
NMI disarmed.

The underlying problem is the non synchronized update of sample_period and
this band aid hack is not solving that at all.

> The above scenario should be rare in practice, so it seemed reasonable to
> come up with a simple low-risk fix that addresses merely this particular
> scenario (watchdog_timer_fn() and watchdog_overflow_callback() now return
> immediately while park is in progress / no lockup detection is performed
> during that window of time).

That thing is neither reasonable nor a fix. It's mindless hackery which
papers over the problem instead of fixing the root cause. It neither saw
the problem in the park/unpark in general which is again a valid source for
false positives.

> In relation to:
>
> "That commit added a complete abstrusity with a atomic variable telling the
> NMI watchdog function that stop is in progress. This is so stupid, that
> it's not funny anymore."
>
> I cannot see any 'abstrusity' or 'stupidity' in the pragmatic approach that
> was taken in b94f51183b06 to fix an issue that should be rare in practice
> as described above.
>
> Please change the commit log of [patch 11/29] since it is inconsistent with
> the commit log of b94f51183b06.

Yes, I will change it to include the above reasoning why this is complete
crap and keep the extra findings for reference.

Thanks,

tglx