Re: [RFC PATCH v2] tick: Make tick_periodic() check for missing ticks
From: Waiman Long
Date: Sun Mar 15 2020 - 22:43:28 EST
On 3/15/20 10:20 PM, Guenter Roeck wrote:
> Hi,
>
> On Fri, Feb 07, 2020 at 02:39:29PM -0500, Waiman Long wrote:
>> The tick_periodic() function is used at the beginning part of the
>> bootup process for time keeping while the other clock sources are
>> being initialized.
>>
>> The current code assumes that all the timer interrupts are handled in
>> a timely manner with no missing ticks. That is not actually true. Some
>> ticks are missed and there are some discrepancies between the tick time
>> (jiffies) and the timestamp reported in the kernel log. Some systems,
>> however, are more prone to missing ticks than the others. In the extreme
>> case, the discrepancy can actually cause a soft lockup message to be
>> printed by the watchdog kthread. For example, on a Cavium ThunderX2
>> Sabre arm64 system:
>>
>> [ 25.496379] watchdog: BUG: soft lockup - CPU#14 stuck for 22s!
>>
>> On that system, the missing ticks are especially prevalent during the
>> smp_init() phase of the boot process. With an instrumented kernel,
>> it was found that it took about 24s as reported by the timestamp for
>> the tick to accumulate 4s of time.
>>
>> Investigation and bisection done by others seemed to point to the
>> commit 73f381660959 ("arm64: Advertise mitigation of Spectre-v2, or
>> lack thereof") as the culprit. It could also be a firmware issue as
>> new firmware was promised that would fix the issue.
>>
>> To properly address this problem, we cannot assume that there will
>> be no missing tick in tick_periodic(). This function is now modified
>> to follow the example of tick_do_update_jiffies64() by using another
>> reference clock to check for missing ticks. Since the watchdog timer
>> uses running_clock(), it is used here as the reference. With this patch
>> applied, the soft lockup problem in the arm64 system is gone and tick
>> time tracks much more closely to the timestamp time.
>>
>> Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
> Since this patch is in linux-next, roughly 10% of my x86 and x86_64
> qemu emulation boots are stalling. Typical log:
>
> [ 0.002016] smpboot: Total of 1 processors activated (7576.40 BogoMIPS)
> [ 0.002016] devtmpfs: initialized
> [ 0.002016] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
> [ 0.002016] futex hash table entries: 256 (order: 3, 32768 bytes, linear)
> [ 0.002016] xor: measuring software checksum speed
>
> another:
>
> [ 0.002653] Freeing SMP alternatives memory: 44K
> [ 0.002653] smpboot: CPU0: Intel Westmere E56xx/L56xx/X56xx (IBRS update) (family: 0x6, model: 0x2c, stepping: 0x1)
> [ 0.002653] Performance Events: unsupported p6 CPU model 44 no PMU driver, software events only.
> [ 0.002653] rcu: Hierarchical SRCU implementation.
> [ 0.002653] smp: Bringing up secondary CPUs ...
> [ 0.002653] x86: Booting SMP configuration:
> [ 0.002653] .... node #0, CPUs: #1
> [ 0.000000] smpboot: CPU 1 Converting physical 0 to logical die 1
>
> ... and then there is silence until the test aborts.
>
> This is only (or at least predominantly) seen if the system running
> the emulation is under load.
>
> Reverting this patch fixes the problem.
I was aware that there are some problem with this patch, but it is hard
to reproduce it. Do you have a more consistent way to reproduce it.
When you say under load, you mean that the host system is also busy so
that there are a lot of vcpu preemption. Right? Could you give me the
x86-64 .config file that you use?
Thanks,
Longman