Re: [REGRESSION] ? system is stuck in clocksource, >60s delay at boot time without tsc=unstable

From: Fab Stz
Date: Fri Jan 03 2025 - 11:40:29 EST


Hello John,

Le 02/01/2025 à 22:56, John Stultz a écrit :
On Thu, Jan 2, 2025 at 1:49 PM John Stultz <jstultz@xxxxxxxxxx> wrote:

On Fri, Dec 27, 2024 at 4:39 AM Fab Stz <fabstz-it@xxxxxxxx> wrote:

Hello,

It's been one month now that I sent this email. Do you have any clue on this?

Apologies you didn't get a quick response, but you didn't really cc
many people on the first one.

No problem. I thought it was better not to put too many people in copy in the first message given that it was also sent to the mailing list.

Le mercredi 27 novembre 2024, 08:18:41 CET Fab Stz a écrit :
Hi,

While upgrading from Debian bullseye (kernel 5.10) to bookworm (6.1) I
noticed that the newer kernel is at the beginning of the boot stuck for
more than 60 seconds.

This is apparently related to the clocksource module. If I boot with
tsc=unstable there is no more delay.

In the kernel logs, I have:

clocksource: Long readout interval, skipping watchdog check: cs_nsec:
512010551 wd_nsec: 39243763320
clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as
unstable because the skew is too large:
clocksource: 'hpet' wd_nsec: 537773520 wd_now:
3f0f7632 wd_last: 3e425140 mask: ffffffff
clocksource: 'tsc' cs_nsec: 511996079 cs_now:
18b0866e6a cs_last: 185f8d68ba mask: ffffffffffffffff
clocksource: 'tsc' is current clocksource.
tsc: Marking TSC unstable due to clocksource watchdog
TSC found unstable after boot, most likely due to broken BIOS. Use
'tsc=unstable'.
sched_clock: Marking unstable (3765559657, 1276001)<-(3775071370, -8235646)
clocksource: Checking clocksource tsc synchronization from CPU 1 to CPUs 0.
clocksource: Switched to clocksource hpet


I already had such a warning with 5.10, but there was no >60sec freeze
with it like with 6.1

So, it sounds like your TSC stalls in idle (likely missing
X86_FEATURE_NONSTOP_TSC), and probably something between 5.10 and 6.1
added a sleep which causes the stall before the clocksource watchdog
can check and disable the TSC on its own.

The kernel is telling you tsc=unstable is the way to go here, and it
seems that is working for you. From my first glance, I'd not call
this a regression, as the kernel was warning you about the problematic
hardware before, and it was most likely just luck that it was able to
auto-detect the problem before there were any negative results.

Debian even suggests this for the iMac9,1 hardware you're using:
https://wiki.debian.org/InstallingDebianOn/Apple/iMac/9-1#Boot_on_installer

And highlights the exact behavior you describe (maybe this is your efforts?):
https://wiki.debian.org/InstallingDebianOn/Apple/iMac/9-1#Kernel_configuration


I'm the author of that page on the debian wiki, indeed.


My findings are as follows:

* No delay with the following kernel versions shipped by debian (when run on up-to-date bookworm as of today)
5.10.226, 5.19.11, 6.0.10, 6.1.4, 6.1.27, 6.1.38, 6.1.66, 6.1.76, 6.1.82

* Delay with the following kernel versions:
5.15.15, 6.1.85, 6.1.119

So something probably happened between 6.1.82 & 6.1.85 (debian doesn't ship packages for versions between them). Why 5.15.15 also has a delay is not clear.

For the versions where there is a delay, the warning from clocksource mentioning an unstable clock always comes after the first line that mentions USB "ACPI: bus type USB registered".

For the versions which don't have a boot delay, the warning from clocksource mentioning an unstable clock always comes before the first line that mentions USB "ACPI: bus type USB registered".

However, with 6.1.82, sometimes the unstable clocksource message comes after the USB line, but when this happens, both messages are very close in time (less than 50ms?) so that the subsequent usb messages always appear after the clocksource message. So the return from the clocksource might be early enough to not encounter the lock.

Actually, the lock is usually bit later than the "ACPI: bus type USB registered", and the message at the time of the lock is related to USB.

Moreover, whether there is a boot delay or not:

- the line "ACPI: bus type USB registered" always comes after "Run /init as init process"

- the warning from clocksource mentioning an unstable clock may or may not be after "Run /init as init process"

Could it be that USB should not be registered/loaded before it was determined whether clocksource is unstable or not?

Regards
Fab