RE: [PATCH v2] NMI: fix NMI period is not correct when cpufrequency changes issue.

From: Pan, Zhenjie
Date: Mon Apr 22 2013 - 20:53:27 EST




> -----Original Message-----
> From: Don Zickus [mailto:dzickus@xxxxxxxxxx]
> Sent: Tuesday, April 23, 2013 2:59 AM
> To: Pan, Zhenjie
> Cc: Stephane Eranian; Peter Zijlstra; paulus@xxxxxxxxx; mingo@xxxxxxxxxx;
> acme@xxxxxxxxxxxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; tglx@xxxxxxxxxxxxx;
> Liu, Chuansheng; linux-kernel@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH v2] NMI: fix NMI period is not correct when cpu
> frequency changes issue.
>
> On Mon, Apr 22, 2013 at 12:50:34AM +0000, Pan, Zhenjie wrote:
> > > I believe it mattered to the Chrome folks. They want the watchdog to
> > > be as tight as possible so the user experience isn't a hang but a
> > > quick reboot instead. They like setting the watchdog to something like 2
> seconds.
> > >
> > > There was a patch a few months ago that tried to hack around this
> > > issue and I suggested this approach as a better solution. I forgot
> > > what the original problem was. Perhaps someone can jump in and
> > > explain the problem being solved (other than the watchdog isn't always
> 10 seconds)?
> > >
> > > Cheers,
> > > Don
> >
> > Yes, I also think the period is important sometimes.
> > As I mentioned before, the case I meet is:
> > When the system hang with interrupt disabled, we use NMI to detect.
> > Then it will find hard lockup and cause a panic.
> > Panic is very important for debug these kind of issues.
> >
> > But if cpu frequency change, the period will be 2 times, 3 times even
> > more.(if cpu can down from 2.0GHz to 200MHz, will be 10 times, it's a very
> big deviation) This make watchdog reset happen before hard lockup detect.
>
> So you are saying with the longer hard lockup delay, the iTCO_wdt is firing
> before the hard lockup detector?
>
> Cheers,
> Don

Give you a detail example:
0s 50s 60s 70s
|_____________________________________|___________|__________|
When 50s, a watchdog interrupt happen to inform watchdog daemon to update watchdog.
If watchdog daemon does not update watchdog in 10s, another watchdog interrupt will happen at 60s to cause a panic.
Then system will have 10s to do some dump.
At 70s, watchdog hardware reset happen.

But if interrupt is disabled at 60s, panic will be lost.
So we need NMI interrupt by performance monitor to detect hard lockup.
If the NMI period is 10s, it can guarantee that hard lockup will be detected before 70s.
But if the period is changed with cpu frequency, this will be not ensure.

Hope my explanation is clear.

BTW, I use intel_scu_watchdog(but looks have big difference with that in upstream), not iTCO_wdt.

Thanks
Pan Zhenjie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/