Re: [External] Re: [PATCH] Clocksource: Avoid misjudgment of clocksource

From: John Stultz
Date: Tue Oct 12 2021 - 01:06:22 EST


On Sat, Oct 9, 2021 at 2:02 AM yanghui <yanghui.def@xxxxxxxxxxxxx> wrote:
>
>
>
> 在 2021/10/9 上午11:38, John Stultz 写道:
> > On Fri, Oct 8, 2021 at 8:22 PM yanghui <yanghui.def@xxxxxxxxxxxxx> wrote:
> >> 在 2021/10/9 上午7:45, John Stultz 写道:
> >>> On Fri, Oct 8, 2021 at 1:03 AM yanghui <yanghui.def@xxxxxxxxxxxxx> wrote:
> >>>>
> >>>> clocksource_watchdog is executed every WATCHDOG_INTERVAL(0.5s) by
> >>>> Timer. But sometimes system is very busy and the Timer cannot be
> >>>> executed in 0.5sec. For example,if clocksource_watchdog be executed
> >>>> after 10sec, the calculated value of abs(cs_nsec - wd_nsec) will
> >>>> be enlarged. Then the current clocksource will be misjudged as
> >>>> unstable. So we add conditions to prevent the clocksource from
> >>>> being misjudged.
> >>>>
> >>>> Signed-off-by: yanghui <yanghui.def@xxxxxxxxxxxxx>
> >>>> ---
> >>>> kernel/time/clocksource.c | 6 +++++-
> >>>> 1 file changed, 5 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> >>>> index b8a14d2fb5ba..d535beadcbc8 100644
> >>>> --- a/kernel/time/clocksource.c
> >>>> +++ b/kernel/time/clocksource.c
> >>>> @@ -136,8 +136,10 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating);
> >>>>
> >>>> /*
> >>>> * Interval: 0.5sec.
> >>>> + * MaxInterval: 1s.
> >>>> */
> >>>> #define WATCHDOG_INTERVAL (HZ >> 1)
> >>>> +#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
> >>>>
> >>>> static void clocksource_watchdog_work(struct work_struct *work)
> >>>> {
> >>>> @@ -404,7 +406,9 @@ static void clocksource_watchdog(struct timer_list *unused)
> >>>>
> >>>> /* Check the deviation from the watchdog clocksource. */
> >>>> md = cs->uncertainty_margin + watchdog->uncertainty_margin;
> >>>> - if (abs(cs_nsec - wd_nsec) > md) {
> >>>> + if ((abs(cs_nsec - wd_nsec) > md) &&
> >>>> + cs_nsec < WATCHDOG_MAX_INTERVAL_NS &&
> >>>
> >>> Sorry, it's been awhile since I looked at this code, but why are you
> >>> bounding the clocksource delta here?
> >>> It seems like if the clocksource being watched was very wrong (with a
> >>> delta larger than the MAX_INTERVAL_NS), we'd want to throw it out.
> >>>
> >>>> + wd_nsec < WATCHDOG_MAX_INTERVAL_NS) {
> >>>
> >>> Bounding the watchdog interval on the check does seem reasonable.
> >>> Though one may want to keep track that if we are seeing too many of
> >>> these delayed watchdog checks we provide some feedback via dmesg.
> >>
> >> Yes, only to check watchdog delta is more reasonable.
> >> I think Only have dmesg is not enough, because if tsc was be misjudged
> >> as unstable then switch to hpet. And hpet is very expensive for
> >> performance, so if we want to switch to tsc the only way is to reboot
> >> the server. We need to prevent the switching of the clock source in
> >> case of misjudgment.
> >> Circumstances of misjudgment:
> >> if clocksource_watchdog is executed after 10sec, the value of wd_delta
> >> and cs_delta also be about 10sec, also the value of (cs_nsec- wd_nsec)
> >> will be magnified 20 times(10sec/0.5sec).The delta value is magnified.
> >
> > Yea, it might be worth calculating an error rate instead of assuming
> > the interval is fixed, but also just skipping the check may be
> > reasonable assuming timers aren't constantly being delayed (and it's
> > more of a transient state).
> >
> > At some point if the watchdog timer is delayed too much, the watchdog
> I mean the execution cycle of this function(static void
> clocksource_watchdog(struct timer_list *unused)) has been delayed.
>
> > hardware will fully wrap and one can no longer properly compare
> > intervals. That's why the timer length is chosen as such, so having
> > that timer delayed is really pushing the system into a potentially bad
> > state where other subtle problems are likely to crop up.
> >
> > So I do worry these watchdog robustness fixes are papering over a
> > problem, pushing expectations closer to the edge of how far the system
> > should tolerate bad behavior. Because at some point we'll fall off. :)
>
> Sorry,I don't seem to understand what you mean. Should I send your Patch
> v2 ?

Sending a v2 is usually a good step (persistence is key! :)

I'm sorry for being unclear in the above. I'm mostly just fretting
that the watchdog logic has inherent assumptions that the timers won't
be greatly delayed. Unfortunately the reality is that the timers may
be delayed. So we can try to add some robustness (as your patch does),
but at a certain point, the delays may exceed what the logic can
tolerate and produce correct behavior. I worry that by pushing the
robustness up to that limit, folks may not recognize the problematic
behavior (greatly delayed timers - possibly caused by drivers
disabling irqs for too long, or bad SMI logic, or long virtualization
pauses), and think the system is still working as designed, even
though its regularly exceeding the bounds of the assumptions in the
code. So without any feedback that something is wrong, those bounds
will continue to be pushed until things really break in a way we
cannot be robust about.

That's why I was suggesting adding some sort of printk warning when we
do see a number of delayed timers so that folks have some signal that
things are not as they are expected to be.

thanks
-john