Re: [PATCH 2/2] x86/tsc_sync: Add synchronization overhead to tsc adjustment

From: Thomas Gleixner
Date: Sun Apr 03 2022 - 06:03:38 EST


On Mon, Mar 14 2022 at 15:46, Waiman Long wrote:
> As stated in the comment of check_tsc_sync_target():
>
> The adjustment value is slightly off by the overhead of the
> sync mechanism (observed values are ~200 TSC cycles), but this
> really depends on CPU, node distance and frequency. So
> compensating for this is hard to get right.
>
> That overhead, however, can cause the tsc adjustment to fail after 3
> test runs as can be seen when booting up a hot 4-socket Intel CooperLake
> system:
>
> [ 0.034090] TSC deadline timer available
> [ 0.008807] TSC ADJUST compensate: CPU36 observed 95626 warp. Adjust: 95626
> [ 0.008807] TSC ADJUST compensate: CPU36 observed 74 warp. Adjust: 95700
> [ 0.974281] TSC synchronization [CPU#0 -> CPU#36]:
> [ 0.974281] Measured 4 cycles TSC warp between CPUs, turning off TSC clock.
> [ 0.974281] tsc: Marking TSC unstable due to check_tsc_sync_source failed
>
> To prevent this tsc adjustment failure, we need to estimate the sync
> overhead which will be at least an unlock operation in one cpu followed
> by a lock operation in another cpu.
>
> The measurement is done in check_tsc_sync_target() after stop_count
> reached 2 which is set by the source cpu after it re-initializes the sync
> variables causing the lock cacheline to be remote from the target cpu.
> The subsequent time measurement will then be similar to latency between
> successive 2-cpu sync loop in check_tsc_warp().
>
> Interrupt should not yet been enabled when check_tsc_sync_target() is

Interrupts _ARE_ not enabled.

> called. However some interference may have caused the overhead estimation
> to vary a bit. With this patch applied, the measured overhead on the
> same CooperLake system on different reboot runs varies from 104 to 326.

Hmm.

> [ 0.008807] TSC ADJUST compensate: CPU36 observed 95626 warp. Adjust: 95626
> [ 0.008807] TSC ADJUST compensate: CPU36 observed 74 warp. Adjust: 95700
> [ 0.974281] Measured 4 cycles TSC warp between CPUs, turning off TSC clock.
> [ 0.974281] tsc: Marking TSC unstable due to check_tsc_sync_source failed

Can you please provide the output after your changes? It's hard to tell
what effect adding the lock compensation has. Ideally from several runs
for both patched and unpatched.

Also if the compensation value is at the upper end and the real overhead
is way lower then the validation run might end up with the opposite
result. I'm a bit worried about this variation.

Thanks,

tglx