Re: [PATCH 2/2] x86/tsc_sync: Add synchronization overhead to tsc adjustment

From: Waiman Long
Date: Tue Apr 26 2022 - 11:36:18 EST

Next message: Rahul T R: "[PATCH v4 2/2] arm64: dts: ti: k3-j721e-common-proc-board: add DP to j7 evm"
Previous message: Kefeng Wang: "Re: [PATCH] arm64: kcsan: Fix kcsan test_barrier fail and panic"
In reply to: Thomas Gleixner: "Re: [PATCH 2/2] x86/tsc_sync: Add synchronization overhead to tsc adjustment"
Next in thread: Thomas Gleixner: "Re: [PATCH 2/2] x86/tsc_sync: Add synchronization overhead to tsc adjustment"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 4/25/22 15:24, Thomas Gleixner wrote:

On Mon, Apr 25 2022 at 09:20, Waiman Long wrote:

On 4/22/22 06:41, Thomas Gleixner wrote:

I did some experiments and noticed that the boot time overhead is
different from the overhead when doing the sync check after boot
(offline a socket and on/offline the first CPU of it several times).

During boot the overhead is lower on this machine (SKL-X), during
runtime it's way higher and more noisy.

The noise can be pretty much eliminated by running the sync_overhead
measurement multiple times and building the average.

The reason why it is higher is that after offlining the socket the CPU
comes back up with a frequency of 700Mhz while during boot it runs with
2100Mhz.

Sync overhead: 118
Sync overhead: 51 A: 22466 M: 22448 F: 2101683

One explanation of the sync overhead difference (118 vs 51) here is
whether the lock cacheline is local or remote. My analysis the
interaction between check_tsc_sync_source() and check_tsc_sync_target()
is that real overhead is about locking with remote cacheline (local to
source, remote to target). When you do a 256 loop of locking, it is all
local cacheline. That is why the overhead is lower. It also depends on
if the remote cacheline is in the same socket or a different socket.

Yes. It's clear that the initial sync overhead is due to the cache line
being remote, but I rather underestimate the compensation. Aside of that
it's not guaranteed that the cache line is actually remote on the first
access. It's by chance, but not by design.

In check_tsc_warp(), the (unlikely(prev > now) check may only be triggered to record the possible wrap if last_tsc was previously written to by another cpu. That requires the transfer of lock cacheline from the remote cpu to local cpu as well. So sync overhead with remote cacheline is what really matters here. I had actually thought about just measuring local cacheline sync overhead so as to underestimate it and I am fine about doing it.

Sync overhead: 178
Sync overhead: 152 A: 22477 M: 67380 F: 700529

Sync overhead: 212
Sync overhead: 152 A: 22475 M: 67380 F: 700467

Sync overhead: 153
Sync overhead: 152 A: 22497 M: 67452 F: 700404

Can you try the patch below and check whether the overhead stabilizes
accross several attempts on that copperlake machine and whether the
frequency is always the same or varies?

Yes, I will try that experiment and report back the results.

Independent of the outcome on that, I think have to take the actual CPU
frequency into account for calculating the overhead.

Assuming that the clock frequency remains the same during the
check_tsc_warp() loop and the sync overhead computation time, I don't
think the actual clock frequency matters much. However, it will be a
different matter if the frequency does change. In this case, it is more
likely the frequency will go up than down. Right? IOW, we may
underestimate the sync overhead in this case. I think it is better than
overestimating it.

The question is not whether the clock frequency changes during the loop.
The point is:

start = rdtsc();
do_stuff();
end = rdtsc();
compensation = end - start;
do_stuff() executes a constant number of instructions which are executed
in a constant number of CPU clock cycles, let's say 100 for simplicity.
TSC runs with 2000MHz.

With a CPU frequency of 1000 MHz the real computation time is:

100/1000MHz = 100 nsec = 200 TSC cycles

while with a CPU frequency of 2000MHz it is obviously:

100/2000MHz = 50 nsec = 100 TSC cyles

IOW, TSC runs with a constant frequency independent of the actual CPU
frequency, ergo the CPU frequency dependent execution time has an
influence on the resulting compensation value, no?

On the machine I tested on, it's a factor of 3 between the minimal and
the maximal CPU frequency, which makes quite a difference, right?

Yes, I understand that. The measurement of sync_overhead is for estimating the delay (in TSC cycles) that the locking overhead introduces. With 1000MHz frequency, the delay in TSC cycle will be double that of a cpu running at 2000MHz. So you need more compensation in this case. That is why I said that as long as clock frequency doesn't change in the check_tsc_wrap() loop and the sync_overhead measurement part of the code, the actual cpu frequency does not matter here.

However about we half the measure sync_overhead as compensation to avoid over-estimation, but probably increase the chance that we need a second adjustment of TSC wrap.

Cheers,
Longman

Next message: Rahul T R: "[PATCH v4 2/2] arm64: dts: ti: k3-j721e-common-proc-board: add DP to j7 evm"
Previous message: Kefeng Wang: "Re: [PATCH] arm64: kcsan: Fix kcsan test_barrier fail and panic"
In reply to: Thomas Gleixner: "Re: [PATCH 2/2] x86/tsc_sync: Add synchronization overhead to tsc adjustment"
Next in thread: Thomas Gleixner: "Re: [PATCH 2/2] x86/tsc_sync: Add synchronization overhead to tsc adjustment"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]