Re: [PATCH 2/3] watchdog/softlockup: Report the overall time of softlockups

From: Petr Mladek
Date: Fri Jan 17 2020 - 05:00:42 EST


On Thu 2020-01-16 14:39:31, Thomas Gleixner wrote:
> Petr,
>
> Petr Mladek <pmladek@xxxxxxxx> writes:
> > - if (touch_ts == 0) {
> > + /* Was the watchdog touched externally by a known slow code? */
> > + if (period_ts == 0) {
> > if (unlikely(__this_cpu_read(softlockup_touch_sync))) {
> > /*
> > * If the time stamp was touched atomically
> > @@ -394,7 +405,12 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
> >
> > /* Clear the guest paused flag on watchdog reset */
> > kvm_check_and_clear_guest_paused();
> > - __touch_watchdog();
> > + /*
> > + * The above kvm*() function could touch the watchdog.
> > + * Set the real timestamp later to avoid an infinite
> > loop.
>
> This comment makes no sense whatsoever. If period_ts is 0,
> i.e. something invoked touch_softlockup_watchdog*() then it does not
> make any difference whether the kvm function invokes one of those
> functions once more. The result is the same. The per cpu period_ts is
> still 0.
>
> The point is that _AFTER_ a intentional watchdog reset, the reporting
> base time needs to be set to now() in order to make it functional again.

Exactly. I think that my comment is just confusing. I wanted to say
that the order was important.

It was not obvious to me that kvm_check_and_clear_guest_paused() cleared
perior_ts and must be called before update_report_period_ts(). I spent
some time with debugging why the reshufled code stopped working ;-)

What about the following?

/*
* Clear the guest paused flag on watchdog. Might clear
* report_period_ts.
*/
kvm_check_and_clear_guest_paused();

update_report_period_ts();


> > + */
> > + reset_report_period_ts();
>
> Btw, the function name is misleading. I got confused several times
> because I expected the reset to set the timestamp to 0, which is not the
> case. update_report_period_ts() is far more intuitive.

Sounds good.


> > @@ -404,8 +420,9 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
> > * indicate it is getting cpu time. If it hasn't then
> > * this is a good indication some task is hogging the cpu
> > */
> > - duration = is_softlockup(touch_ts);
> > + duration = is_softlockup(touch_ts, period_ts);
> > if (unlikely(duration)) {
>
> This lacks a comment. Your changelog paragraph:
>
> > Also the timestamp should get reset explicitly. It is done also by the code
> > printing the backtrace. But it is just a side effect and far from
> > obvious.
>
> is probably referring to this, but it confused me more than it helped
> simply because the update of the timestamp happens unconditionally even
> when the backtrace code is not reached due to the KVM check

Where is the timestamp updated unconditionaly, please?

I found that it happens, for example, in printk_stack_address() that
is hidden "deep" in show_stack().

> So this is a change vs. the current implementation which is not
> documented and explained.

To me, it looks obvious to reset/update the period when the current
period ended and we are about to report softlockup.

OK, it might create wrong assumtion that the updated timestamp will
really get used. It is not true because it will get reset inside
the above mentioned show_stack(). But is it really guaranteed
that the watchdog will be touched there?

IMHO, the explicit call makes the code more reliable and easier
to understand. The hidden touch() might get re(moved) at any time.


> > + reset_report_period_ts();
> > /*
> > * If a virtual machine is stopped by the host it can look to
> > * the watchdog like a soft lockup, check to see if the host

Thanks a lot for review.

Best Regards,
Petr

PS: I will have only limited internet connection the following week.