Re: [patch 0/2] tsc/adjust: Cure suspend/resume issues and prevent TSC deadline timer irq storm

From: Thomas Gleixner
Date: Wed Dec 14 2016 - 03:14:59 EST


On Wed, 14 Dec 2016, Roland Scheidegger wrote:
> Am 13.12.2016 um 17:46 schrieb Thomas Gleixner:
> > What are the adjust values after a warm boot?
>
> So, after cold boot with a kernel which doesn't adjust TSCs, then warm
> boot I got:
> [ 0.000000] TSC ADJUST: CPU0: -602358264300 176072418728
> [ 0.000000] TSC ADJUST: Boot CPU0: -602358264300
> [ 0.172245] TSC ADJUST: CPU1: -602360207584 176587932558
> [ 0.172245] TSC ADJUST differs: Reference CPU0: -602358264300 CPU1:
> -602360207584
> [ 0.172246] TSC ADJUST synchronize: Reference CPU0: -602358264300
> CPU1: -602360207584
> [ 0.252663] TSC ADJUST: CPU2: -602359000822 176828627154
> [ 0.252663] TSC ADJUST differs: Reference CPU0: -602358264300 CPU2:
> -602359000822
> [ 0.252664] TSC ADJUST synchronize: Reference CPU0: -602358264300
> CPU2: -602359000822
> [ 0.337014] TSC ADJUST: CPU3: -602360177680 177081093132
> [ 0.337014] TSC ADJUST differs: Reference CPU0: -602358264300 CPU3:
> -602360177680
> [ 0.337015] TSC ADJUST synchronize: Reference CPU0: -602358264300
> CPU3: -602360177680
>
> and so on.
>
> Albeit after another reboot (some minutes later), it actually straight
> locked up again:
>
> TSC ADJUST: CPU1: -8257481427958 165112676430
> TSC ADJUST differs: Reference CPU0: -8257479484330 CPU1: -8257481427958
> TSC ADJUST synchronize: Reference CPU0: -8257479484330 CPU1: -8254781427958
> TSC target sync skip
> ...
> smpboot: Target CPU is online
>
> So, actually I thought the TSC would get reset too on warm boot, but
> clearly looks like that isn't the case...
> But I don't know what's the difference between first and second reboot -
> the adjust values have just more magnitude, but otherwise even the
> direction of the adjustments and everything looks all the same (just
> like cold boot, which also looks all the same to me).

I haven't found a pattern for the lockups yet and we have to wait for Intel
to provide useful information about that issue. All we know so far is that
negative adjust values are dangerous.

Could you test the two patches on top of tip x86/timers branch so we can
make progress with that whole disaster while waiting for Intel to come
forth with a proper explanation?

Thanks,

tglx