Re: High rate of touch_softlockup makes Soft Lockup detector useless

From: Thomas Gleixner
Date: Thu Jul 07 2016 - 11:19:37 EST


On Wed, 6 Jul 2016, Joel Fernandes wrote:
> In a system running a recent kernel, I am trying to use soft lockup
> detector to detect soft lockups in the system.
> During this exercise, I see that even with real soft lockups, the
> kernel is unable to detect them.

What is your definition of a real soft lockup?

> Digging in further, I found that the softlockup watchdog is touched
> 1000s of times per second by the NOHZ code.
> prints revealed the following 2 functions calling touch_softlockup_watchdog:
> [ 165.960292] CPU0 touch: tick_nohz_restart_sched_tick
> [ 165.960309] CPU1 touch: tick_nohz_update_jiffies
>
> I am wondering, do we really need to touch the softlockup watchdog
> from the tick_nohz code?
> From the code comments it looks like the watchdog is touch'ed because
> the tick was off and was being turned on so it could the watchdog may
> not have been touched for a long time.
> BUT, wouldn't the hrtimer interrupt for the watchdog timer cause the
> watchdog thread to be scheduled even though the tick was off for a
> long time? Then in that case do we really need to touch the softlockup
> watchdog from the tick_nohz code?

Yes, it will be scheduled, but it might be too late. Assume the following:

t1 hrtimer fires
watchdog thread runs
watchdog timer is rearmed to t2 = t1 + period

idle sleep

t2 - 1ms long running thread gets scheduled

t2 hrtimer fires

long running thread stops

watchdog thread runs and detects soft lockup

The soft lockup detector checks whether the CPU is hogged by some random
task. It does so by monitoring whether the watchdog task which is peridocially
scheduled by a hrtimer becomes running before the watchdog period elapses.

If the cpu goes idle then nothing hogs the cpu and the check period can be
canceled.

Thanks,

tglx