Re: [PATCH 8/9] clocksource: Improve unstable clocksource detection

From: Thomas Gleixner
Date: Tue Sep 01 2015 - 14:56:21 EST


On Tue, 1 Sep 2015, Shaohua Li wrote:
> On Tue, Sep 01, 2015 at 07:13:40PM +0200, Thomas Gleixner wrote:
> > On Mon, 31 Aug 2015, Shaohua Li wrote:
> > > On Mon, Aug 31, 2015 at 11:47:52PM +0200, Thomas Gleixner wrote:
> > > > On Mon, 31 Aug 2015, Shaohua Li wrote:
> > > > > > The HPET wraps interval is 0xffffffff / 100000000 = 42.9s
> > > > > >
> > > > > > tsc interval is (0x481250b45b - 0x219e6efb50) / 2200000000 = 75s
> > > > > >
> > > > > > 32.1 + 42.9 = 75
> > > > > >
> > > > > > The example shows hpet wraps, while tsc is marked unstable
> > > > >
> > > > > Thomas & John,
> > > > > Is this data enough to prove TSC unstable issue can be triggered by HPET
> > > > > wrap? I can resend the patch with the data included.
> > > >
> > > > Well, it's enough data to prove:
> > > >
> > > > - that keeping a VM off the CPU for 75 seconds is insane.
> > >
> > > It wraps in 42.9s. 42.9s isn't a long time hard to block. I donât think
> >
> > You think that blocking softirq execution for 42.9 seconds is normal?
> > Seems we are living in a different universe.
>
> I don't say it's normal. I say it's not hard to trigger.

So and because its not hard to trigger, we cure the symptom and do not
think about the insanity of blocking the watchdog for 42+ or 300+
seconds.

> > > it's just VM off. A softirq can hog the cpu.
> >
> > I still want to see prove of that. There is just handwaving about
> > that, but nobody has provided proper data to back that up.
>
> I showed you the TSC runs 75s, while hpet wraps. What info you think can
> prove this?

You prove nothing. You showed me the symptom, but you never showed
real data that a softirq hogs the cpu for 300+ seconds. Still you keep
claiming that.

You did neither provide a proper explanation WHY your VM test blocked
the watchdog for 75 seconds. No, you merily showed me the numbers. And
just because the numbers explain the symptom, that's no justification
WHY we should cure the symptom instead of looking at the root cause.

> > > > - that emulating the HPET with 100MHz shortens the HPET wraparound by
> > > > a factor of 7 compared to real hardware. With a realist HPET
> > > > frequency you have about 300 seconds.
> > > >
> > > > Who though that using 100MHz HPET frequency is a brilliant idea?
> > >
> > > I'm not a VM expert. My guess is the 100Mhz can reduce interrupt. Itâs
> > > insane hypervisor updates HPET count in 14.3Mhz. Switching to HPET can
> > > introduce even higher overhead in virtual, because of the vmexit of
> > > iomemory access
> >
> > Sorry, that does not make any sense at all.
> >
> > - How does 100Mhz HPET frequency reduce interrupts?
> >
> > - What's insane about a lower emulated HPET frequency?
> >
> > - We all know that switching to HPET is more expensive than just
> > using TSC. That's not the question at all and completely
> > unrelated to the 100MHz HPET emulation frequency.
>
> It's meaningless to argue about HPET frequency. The code should not just
> work for 14.3Mhz HPET.

You carefully avoid to answer any of my questions, but you expect from
me to accept your wild guess argumentations?

> > I'm not pretending anything. I'm merily refusing to accept that change
> > w/o a proper explanation WHY the watchdog fails on physical hardware,
> > i.e. WHY it does not run for more than 300 seconds.
>
> It's meaningless to argue about virtual/physical machine too. Linux
> works for both virtual/physical machines.

That has nothing to do with virt vs. physical. Virtualization is meant
to provide proper hardware emulation. Does Linux work with a buggy
APIC emulation? Not at all, but you expect that it just handles an
insane HPET emulation, right?

> What about acpi_pm clocksource then? It wraps in abour 5s. It's sane
> HPET is disabled and acpi_pm is used for watchdog. Do you still think 5s
> is long?

Yes, five seconds is long. It's more than 10 billions worth of cpu
cycles on a 2GHz machine. If your desktop stalls for 5 seconds you are
probably less enthusiatic.

Again, I'm not against making the watchdog more robust, but I'm
against curing the symptoms. As long as you refuse to provide proper
explanations WHY the watchdog is blocked unduly long, this is going
nowhere.

Thanks,

tglx