On Sun, Apr 02, 2023 at 09:04:04PM -0400, Waiman Long wrote:
On 3/31/23 13:16, Paul E. McKenney wrote:Thus far I am seeing eight systems, but out of a large number. So this
On Tue, Mar 28, 2023 at 02:58:54PM -0700, Paul E. McKenney wrote:The purpose of the first patch is just to avoid false cacheline sharing
On Mon, Mar 27, 2023 at 10:19:54AM +0800, Feng Tang wrote:[ . . . ]
On Fri, Mar 24, 2023 at 05:47:33PM -0700, Paul E. McKenney wrote:
On Wed, Mar 22, 2023 at 01:14:48PM +0800, Feng Tang wrote:
Waiman, do you recall what fraction of the benefit was provided by theAnd that is exactly what happened. ;-)Second, we are very occasionally running into console messages like this:I haven't seen this error myself or got similar reports. Usually it
Measured 2 cycles TSC warp between CPUs, turning off TSC clock.
This comes from check_tsc_sync_source() and indicates that one CPU's
TSC read produced a later time than a later read from some other CPU.
I am beginning to suspect that these can be caused by unscheduled delays
in the TSC synchronization code, but figured I should ask you if you have
ever seen these. And of course, if so, what the usual causes might be.
should be easy to detect once happened, as falling back to HPET
will trigger obvious performance degradation.
Could you give more detail about when and how it happens, and theWe are in early days, so I am checking for other experiences.
HW info like how many sockets the platform has.
CC Thomas, Waiman, as they discussed simliar case here:Fun! ;-)
https://lore.kernel.org/lkml/87h76ew3sb.ffs@tglx/T/#md4d0a88fb708391654e78312ffa75b481690699f
first patch, that is, the one that grouped the sync_lock, last_tsc,
max_warp, nr_warps, and random_warps global variables into a single
struct?
between the watchdog cpu and another cpu that happens to access a nearby
data in the same cacheline.
Now I realize that I should have followed up with this patch series. The
problem reported in that patch series happen on one system only, I believe.
is very much preliminary.
Almost all of them are single-socket systems.And what we are seeing is unlikely to be due to cache-latency-inducedMaybe the difference in wrap is due to NUMA distance of the running cpu from
delays. We see a very precise warp, for example, one system always
has 182 cycles of TSC warp, another 273 cycles, and a third 469 cycles.
Another is at the insanely large value of about 2^64/10, and shows some
variation, but that variation is only about 0.1%.
But any given system only sees warp on about half of its reboots.
Perhaps due to the automation sometimes power cycling?
There are few enough affected systems that investigation will take
some time.
the node where the data reside. It will be interesting to see if my patch
helps.
If the problem sticks with a few systems, I should be able to test
patches no problem. If it is randomly distributed across the fleet, a
bit more prework analysis will be called for. But what is life without
a challenge? ;-)