Re: [RFC 0/3] watchdog/softlockup: Make softlockup reports more reliable and useful

From: Thomas Gleixner
Date: Mon Jun 17 2019 - 17:14:11 EST


On Wed, 5 Jun 2019, Petr Mladek wrote:

> Hi,
>
> we were analyzing logs with several softlockup reports in flush_tlb_kernel_range().
> They were confusing. Especially it was not clear whether it was deadlock,
> livelock, or separate softlockups.
>
> It went out that even a simple busy loop:
>
> while (true)
> cpu_relax();
>
> is able to produce several softlockups reports:
>
> [ 168.277520] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [cat:4865]
> [ 196.277604] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [cat:4865]
> [ 236.277522] watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [cat:4865]
>
>
> I tried to understand the tricky watchdog code and produced two patches
> that would be helpful to debug the original real bug:
>
> 1st patch prevents restart of the watchdog from unrelated locations.
>
> 2nd patch helps to distinguish several possible situations by
> regular reports.
>
> 3rd patch can be used for testing the problem.
>
>
> The watchdog code might deserve even more clean up. Anyway, I would
> like to hear other's opinion first.

Anything which improves debugability is welcome. Unfortunately you missed
to add an example of the output after these patches are applied.

Thanks,

tglx