[HELP] CPU Hard LOCKUP during boot up with HPET clock source

From: Pintu Kumar
Date: Fri Apr 06 2018 - 11:07:33 EST


First the few details:
Kernel: 4.9.20
Machine: x86_64 (AMD)
Model: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
Cores: 8
Available clock source:
# cat /sys/devices/system/clocksource/clocksource0/available_clocksource
tsc hpet acpi_pm

[ 28.027409] NMI watchdog: Watchdog detected hard LOCKUP on cpu
1dModules linked in:c
[ 28.136317] RIP: 0010:[<ffffffff98058c43>] c [<ffffffff98058c43>]

This lockup happens during boot when the cpu is stuck for about ~28 seconds.
This is because of our internal code changes.
During our init function we are running some calibrate loops
10,000,000 (10MHz) times twice.
The LOCKUP is coming because of this loop.

But, we observed that the main issue is the clock source that is
available at that time.
At the time this loop is executed, the available clock source is HPET (not TSC).
With HPET the loop runs slower. It takes almost 28 seconds to complete
with HPET clock source. Hence the boot time also increase by 28
Where as with TSC the loop completes in less than 4 seconds. So, with
TSC we dont get the LOCKUP.

Thus, the lockup is happening only because the loop executes with HPET
clock source.

To fix the problem, I tried the following approach:
1) Use late_initcall for our driver init to delay the call until TSC
clock source is ready.
=> With this there is no LOCKUP trace and no impact on boot time.
This is because the loop executes with TSC.

2) We have 2 loops. So I split the local_irq_save/restore part for
each loops separately.
=> With this also there is no backtrace seen.
=> But boot time is increased.

3) I used delayed_workqueue to delay the execution of the loop by 5
seconds, until TSC is ready.
=> With this there is no back trace and also boot time is normal.
=> But if we disable TSC then we still get the back trace.

4) Disabled HPET from kernel command line using : hpet=disable
=> This also works as the loop executes with the next available
clock source: acpi_pm
=> But changing boot args is not recommended in our case.

5) Disable HPET related configs in kernel
=> This method does not work as we were not able to disable
HPET_TIMER on x86_64.

6) Use hpet_disable() from our code.
=> This method also does not work. It actually does not disable
HPET clock source.

Thus we wanted to know your opinion which is the right solution to fix
this lockup during boot time.

Is there a way to purposefully fallback to next available clock source
(acpi_pm) instead of hpet, from the source code, before executing our
loop ?

Please let me know if there are alternate options.