Re: [PATCH RFC] time: validate watchdog clocksource using second best candidate

From: Konstantin Khlebnikov
Date: Wed Oct 23 2019 - 06:18:42 EST


On 20/05/2019 18.04, Konstantin Khlebnikov wrote:
On 18.05.2019 21:26, Thomas Gleixner wrote:
On Sat, 18 May 2019, Konstantin Khlebnikov wrote:

On 18.05.2019 18:17, Thomas Gleixner wrote:
On Wed, 15 May 2019, Konstantin Khlebnikov wrote:

Timekeeping watchdog verifies doubtful clocksources using more reliable
candidates. For x86 it likely verifies 'tsc' using 'hpet'. But 'hpet'
is far from perfect too. It's better to have second opinion if possible.

We're seeing sudden jumps of hpet counter to 0xffffffff:

On which kind of hardware? A particular type of CPU or random ones?

In general this is very rare event.

This exact pattern have been seen ten times or so on several servers with
Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
(this custom built platform with chipset Intel C610)

and haven't seen for previous generation
Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
(this is another custom built platform)

Same chipset? Note the HPET is part of the chipset not of the CPU.

Almost the same. Intel C600.


Link was in patch: https://lore.kernel.org/patchwork/patch/667413/

Hmm. Not really helpful either.

This patch uses second reliable clocksource as backup for validation.
For x86 this is usually 'acpi_pm'. If watchdog and backup are not consent
then other clocksources will not be marked as unstable at this iteration.

The mess you add to the watchdog code is unholy and that's broken as there
is no guarantee for acpi_pm (or any other secondary watchdog) being
available.

ACPI power management timer is a pretty standard x86 hardware.

Used to be.

But my patch should work for any platform with any second reliable
clocksource.

Which is close to zero if PM timer is not exposed.

If there is no second clocksource my patch does noting:
watchdog_backup stays NULL and backup_consent always true.

That still does not justify the extra complexity for a few custom built
systems.

>
> Aside of that this leaves the HPET in a half broken state. HPET is not only
> used as a clock event device it's also exposed by HPET device. So no, if we
> figure out that HPET is broken on some platforms we have to blacklist and
> disable it completely and not just duct tape the place which exposes the
> wreckage.
>

If re-reading helps then HPET is fine.
This is temporary failure, probably bus issue.


I'll add re-reading with debug logging and try to collect more information this year.

Good news, everyone! We've found conditions when HPET counter returns '-1'.

clocksource: timekeeping watchdog on CPU21: Marking clocksource 'tsc' as unstable because the skew is too large:
clocksource: 'hpet' wd_now: ffffffff wd_last: 40bc1ee3 mask: ffffffff
clocksource: 'tsc' cs_now: 919b39935acdaa cs_last: 919b3957c0ec24 mask: ffffffffffffffff
clocksource: Switched to clocksource hpet

This happens when user-space does inappropriate access to directly mapped HPET.
For example dumping whole "vvar" area: memcpy(buf, addr("vvar"), 0x3000).

In our case sequence was triggered by bug in "atop" crashed at "\n" in task comm. =)

In upstream everything is fine. Direct access to HPET was sealed in 4.7 (we seen bug in 4.4)
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1ed95e52d902035e39a715ff3a314a893a96e5b7

Kudos to Arseny Smalyuk for (accidental) reproducing and investigation.

---

#include <stdio.h>
#include <string.h>

char tmp[0x3000];

int main() {
void* vvar = NULL;
FILE* fp = fopen("/proc/self/maps", "r");

char buf[4096];
while (fgets(buf, 4096, fp)) {
size_t slen = strlen(buf);
if (slen > 0 && buf[slen - 1] == '\n') {
--slen;
}
if (slen > 6 && !memcmp(buf + slen - 6, "[vvar]", 6)) {
sscanf(buf, "%p", &vvar);
break;
}
}

fclose(fp);

memcpy(tmp, vvar, 0x3000);
return 0;
}