Re: [RFC PATCH] clocksource: skip check while watchdog hung up or unstable

From: brookxu
Date: Thu Aug 12 2021 - 20:54:51 EST




Thomas Gleixner wrote on 2021/8/12 6:53 下午:
> On Wed, Aug 11 2021 at 23:26, brookxu wrote:
>> Thomas Gleixner wrote on 2021/8/11 22:01:
>>>> To be precise, we are processing interrupts in handle_edge_irq() for a long
>>>> time. Since the interrupts of multiple hardware queues are mapped to a single
>>>> CPU, multiple cores are continuously issuing IO, and then a single core is
>>>> processing IO. Perhaps the test case can be optimized, but shouldn't this lead
>>>> to switching clocks in principle?
>>>
>>> The clocksource watchdog failure is only _ONE_ consequence. Processing
>>> hard interrupts for 155 seconds straight will trigger lockup detectors
>>> of all sorts if you have them enabled.
>>>
>>> So just papering over the clocksource watchdog does not solve anything,
>>> really. Next week you have to add similar hacks to the lockup detectors,
>>> RCU and whatever.
>>
>> Yeah, we have observed soft lockup and RCU stall, but these behaviors are
>> expected because the current CPU scheduling is disabled. However, marking
>> TSC unstable is inconsistent with the actual situation. The worst problem
>> is that after the clocksource switched to hpet, the abnormal time will be
>> greatly prolonged due to the degradation of performance. We have not found
>> that soft lockup and RCU stall will affect the machine for a long time in
>> this test. Aside from these, as the watchdog is scheduled periodically, when
>> wd_nsec is 0, it means that something maybe abnormal, do we readlly still
>> need to continue to verify TSC? and how to ensure the correctness of the
>> results?
>
> Sorry no. While softlockups and RCU stalls might have no long term
> effect in the first place, this argumentation vs. the clocksource
> watchdog is just a strawman. You're abusing the system in a way which
> causes it to malfunction so you have to live with the consequences.
>
> Aside of that this 'workaround' is just duct taping a particular part of
> the problem. What guarantees that after the interrupt storm subsided the
> clocksource delta of the watchdog becomes 0 (negative)?
>
> Absolutely nothing. The delta can be positive, but then the watchdog and
> the TSC are not in sync anymore which will disable the TSC as well.
>
> A 24MHz HPET has a wraparound time of ~178s which means during:
>
> 89s < tdelta < 178s
>
> your hack papers over the problem. Any interrupt storm time outside of
> that window results in fail.
>
> Now run the same test on a machine with a 14MHz HPET and you get
>
> 153s < tdelta < 306s
>
> so your 155s interrupt storm barely fits. And what are you doing with
> your next test which runs only 80 seconds?
>
> Not to talk about the fact that you wreckage detection of a watchdog
> clocksource going stale.
>
> So no, we are not adding hacks to support abuse.
>
> What we really want to do is to add detection for interrupt storms of
> this sort and shut those interrupts down for good.

ok, thanks for your suggestion.

> Thanks,
>
> tglx
> ---
> Patient: "Doctor, it hurts when I hammer on my toe."
> Doctor: "Don't do that then!"
>
>