Re: [PATCH] x86/tsc: RFC: re-synchronize TSCs to boot cpu TSC

From: Roland Scheidegger
Date: Fri Dec 09 2016 - 20:55:25 EST


Am 09.12.2016 um 23:59 schrieb Thomas Gleixner:
> On Fri, 9 Dec 2016, Roland Scheidegger wrote:
>
> Cc'ed someone from Dell.
>
>> Am 09.12.2016 um 18:33 schrieb Thomas Gleixner:
>>> Can you add the patch below to gather more information? There is a hunk in
>>> there with an '#if 0' which sets the TSC ADJUST to 0 on boot, which you can
>>> turn on as second step.
>>
>> Ok, here's the results:
>> ...
>> TSC ADJUST synchronize: Reference CPU0: -2820267100 CPU1: -2822498296
>> TSC target sync skipped
>> smpboot: Vector locked
>> smpboot: Vector setup done
>> smpboot: Clock setup
>> TSC source sync skipped
>> smpboot: Target CPU is online
>
> I did not expect that to happen. Now I'm puzzled and curious where the
> machine gets lost after that. See below.
>
>> With the #if 0 block activated, it boots up fine, the output was:
>
> That does not make any sense at all, but yes, nothing in this context makes
> sense.
>
>> [ 1.038892] x86: Booting SMP configuration:
>> [ 1.038930] .... node #0, CPUs: #1
>> [ 0.171851] TSC ADJUST: CPU1: -2830353064 218577682002
>> [ 1.117495] TSC source sync 0 -> 1 runs 3
>> [ 0.171852] TSC ADJUST differs: Reference CPU0: -2828600940 CPU1:
>> -2830353064
>> [ 0.171853] TSC ADJUST synchronize: Reference CPU0: 0 CPU1: -2830353064
>> [ 1.117497] TSC target sync skip
>
>> (And fwiw with my quick hack the lockups disappear to when I change that
>> back to blast a zero into TSC_ADJ for all cpus.)
>
> Right, That's what that hunk does as well.
>
> Now what's interesting is that the adjustement of CPU1 in the non write to
> zero case results in the following:
>
> TSC ADJUST: CPU1: -2830353064 218577682002 <-- TSC value
> TSC ADJUST differs: Reference CPU0: -2828600940 CPU1: -2830353064
>
> We write CPU1 adjust register to -2828600940 which makes the TSC on CPU1
> jump forwards by -2828600940 - -2830353064 = 1752124 cycles.
>
> In the write to zero case the jump is forward as well, but this time it's
> huge, i.e. 2830353064 cycles.
>
> I tried to wreckage the TSC by writing similar values to the adjust MSR on
> early boot, but independent of the values and independent of the write to
> zero part the machine comes up happily all the time.
>
> The only difference is that my machine has a somewhat saner BIOS. So the
> thing might just die in the value add SMM crap, but who knows.
>
> In the patch below is another bunch of debug prints which emit the state
> information of CPU1 during bringup. Maybe that gives a hint where the
> system gets stuck when you disable the 'write to zero' magic again.
>
> The NMI watchdog does not catch anything, right?
Nope. (Though as mentioned earlier, with my hack when not writing zero
it did - but the lockup there was later after all 16 cpus were online,
and I only really tried that with the ubuntu 4.4 kernel. I never got to
see the full output from that NMI though due to limited screen space, my
attempts to try anything different than text mode were met with a blank
screen, and from the parts I did see I didn't really see anything
interesting albeit that's not saying much as I really have no idea about
that code...)

With the new patch here's the output (albeit the typing gets a bit
annoying...)
...
Invoking state 32 CB replay_prepare_cpu+0x0/0xe0
CB returned 0
Invoking state 35 CB rcutree_prepare_cpu+0x0/0x50
CB returned 0
Invoking state 41 CB notify_prepare+0x0/0xa0
CBreturned 0
Invoking state 48 CB bringup_cpu+0x0/0x90
x86: Booting SMP configuration:
.... node #0, CPUs: #1
Invoking state 51 CB sched_cpu_starting+0x0/0x60
CB returned 0
Invoking state 62 CB x86_pmu_starting_cpu+0x0/0x20
CB returned 0
TSC ADJUST: CPU1: -2846131604 175264328618
TSC ADJUST differs: Reference CPU0: -2843967660 CPU1: -2846131604
TSC ADJUST synchronize: Reference CPU0: -2843967660 CPU1: -2846131604
TSC target sync skip
smpboot: Vector locked
TSC source sync 0 -> 1 runs 3
smpboot: Vector setup done
smpboot: Clock setup
TSC source sync skipped
smpboot: Target CPU is online


>
>> The system also came back up fine from suspend with this (well - still
>> minus graphics...), however disabled tsc clocksource:
>>
>> [ 579.931739] Enabling non-boot CPUs ...
>> [ 579.943107] smpboot: Booting Node 0 Processor 1 APIC 0x2
>> [ 579.943189] TSC ADJUST: CPU1: -1504429974 21601834126
>
> Fun, yet another adjust value. Are they set by a random number generator?
>
>> [ 579.944093] CPU1 is up
>
>> [ 580.458983] clocksource: timekeeping watchdog on CPU1: Marking
>> clocksource 'tsc' as unstable because the skew is too large:
>> [ 580.458985] clocksource: 'hpet' wd_now: 587c1
>> wd_last: 437c7 mask: ffffffff
>> [ 580.458986] clocksource: 'tsc' cs_now:
>> 563963cd8 cs_last: 508f5a02a mask: ffffffffffffffff
>
> Ok, that's caused by the fact that we do not sanitize the TSC adjust
> register on the boot CPU in the resume path.
Yep. (Which is why my hack hacked the restore path as well - I am still
really confused about boot/resume boot cpu initialization...)

>
>> [ 581.006760] [Firmware Bug]: TSC ADJUST differs: CPU0 0 -->
>> -1502494750. Restoring
>
> We only detect it later and correct it, but that's too late. The untested
> patch below should cure that.
>
>>> This BIOS seems to be rather strange ...
>> Don't tell me...
>>
>> No idea what it's doing, but I think it's safe to say whatever it's
>> trying to do, it's doing it wrong...
>
> Amen to that. I've seen a lot of Value Add BIOSes which broke things left
> and right, but this one takes it to a new level of insanity.
>
> Did you report that back to DELL already?
No, I never did (even though the box actually shipped with ubuntu
(albeit a really old version)). Since it was basically working - just
without TSCs...
I was quite happy with using my hack for quite some time, it just got
annoying I'll either have to stick to one kernel or reapply it all the
time :-).

Roland

>
> Thanks,
>
> tglx
>
> 8<--------------------
> --- a/arch/x86/include/asm/tsc.h
> +++ b/arch/x86/include/asm/tsc.h
> @@ -47,7 +47,7 @@ extern int tsc_clocksource_reliable;
> */
> #ifdef CONFIG_X86_TSC
> extern bool tsc_store_and_check_tsc_adjust(void);
> -extern void tsc_verify_tsc_adjust(void);
> +extern void tsc_verify_tsc_adjust(bool resume);
> extern void check_tsc_sync_source(int cpu);
> extern void check_tsc_sync_target(void);
> #else
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -277,7 +277,7 @@ void exit_idle(void)
>
> void arch_cpu_idle_enter(void)
> {
> - tsc_verify_tsc_adjust();
> + tsc_verify_tsc_adjust(false);
> local_touch_nmi();
> enter_idle();
> }
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -1080,6 +1080,11 @@ static void detect_art(void)
>
> static struct clocksource clocksource_tsc;
>
> +static void tsc_resume(struct clocksource *cs)
> +{
> + tsc_verify_tsc_adjust(true);
> +}
> +
> /*
> * We used to compare the TSC to the cycle_last value in the clocksource
> * structure to avoid a nasty time-warp. This can be observed in a
> @@ -1112,6 +1117,7 @@ static struct clocksource clocksource_ts
> .flags = CLOCK_SOURCE_IS_CONTINUOUS |
> CLOCK_SOURCE_MUST_VERIFY,
> .archdata = { .vclock_mode = VCLOCK_TSC },
> + .resume = tsc_resume,
> };
>
> void mark_tsc_unstable(char *reason)
> --- a/arch/x86/kernel/tsc_sync.c
> +++ b/arch/x86/kernel/tsc_sync.c
> @@ -30,7 +30,7 @@ struct tsc_adjust {
>
> static DEFINE_PER_CPU(struct tsc_adjust, tsc_adjust);
>
> -void tsc_verify_tsc_adjust(void)
> +void tsc_verify_tsc_adjust(bool resume)
> {
> struct tsc_adjust *adj = this_cpu_ptr(&tsc_adjust);
> s64 curval;
> @@ -39,7 +39,7 @@ void tsc_verify_tsc_adjust(void)
> return;
>
> /* Rate limit the MSR check */
> - if (time_before(jiffies, adj->nextcheck))
> + if (!resume && time_before(jiffies, adj->nextcheck))
> return;
>
> adj->nextcheck = jiffies + HZ;
> @@ -51,7 +51,7 @@ void tsc_verify_tsc_adjust(void)
> /* Restore the original value */
> wrmsrl(MSR_IA32_TSC_ADJUST, adj->adjusted);
>
> - if (!adj->warned) {
> + if (!adj->warned || resume) {
> pr_warn(FW_BUG "TSC ADJUST differs: CPU%u %lld --> %lld. Restoring\n",
> smp_processor_id(), adj->adjusted, curval);
> adj->warned = true;
> --- a/arch/x86/power/cpu.c
> +++ b/arch/x86/power/cpu.c
> @@ -256,6 +256,7 @@ static void notrace __restore_processor_
> mtrr_bp_restore();
> perf_restore_debug_store();
> msr_restore_context(ctxt);
> + tsc_verify_tsc_adjust(true);
> }
>
> /* Needed by apm.c */
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -133,7 +133,9 @@ static int cpuhp_invoke_callback(unsigne
> if (!cb)
> return 0;
> trace_cpuhp_enter(cpu, st->target, state, cb);
> + pr_info("Invoking state %d CB %pF\n", state, cb);
> ret = cb(cpu);
> + pr_info("CB returned %d\n", ret);
> trace_cpuhp_exit(cpu, st->state, state, ret);
> return ret;
> }
> @@ -153,7 +155,10 @@ static int cpuhp_invoke_callback(unsigne
> cnt = 0;
> hlist_for_each(node, &step->list) {
> trace_cpuhp_multi_enter(cpu, st->target, state, cbm, node);
> + pr_info("Invoking state %d CB %pF instance %p\n", state, cbm,
> + node);
> ret = cbm(cpu, node);
> + pr_info("CB returned %d\n", ret);
> trace_cpuhp_exit(cpu, st->state, state, ret);
> if (ret)
> goto err;
> @@ -550,6 +555,7 @@ static void cpuhp_thread_fun(unsigned in
>
> st->should_run = false;
>
> + pr_info("CPUHP thread running\n");
> /* Single callback invocation for [un]install ? */
> if (st->single) {
> if (st->cb_state < CPUHP_AP_ONLINE) {
>