Re: [RFC PATCH 1/1] sched: ttwu_queue_cond: perform queued wakeups across different L2 caches

From: Mathieu Desnoyers
Date: Thu Aug 17 2023 - 12:14:18 EST


On 8/17/23 12:09, Mathieu Desnoyers wrote:
On 8/17/23 12:01, Vincent Guittot wrote:
On Thu, 17 Aug 2023 at 17:34, Mathieu Desnoyers
<mathieu.desnoyers@xxxxxxxxxxxx> wrote:

Skipping queued wakeups for all logical CPUs sharing an LLC means that
on a 192 cores AMD EPYC 9654 96-Core Processor (over 2 sockets), groups
of 8 cores (16 hardware threads) end up grabbing runqueue locks of other
runqueues within the same group for each wakeup, causing contention on
the runqueue locks.
[...]

-bool cpus_share_cache(int this_cpu, int that_cpu);
+bool cpus_share_cluster(int this_cpu, int that_cpu);   /* Share L2. */
+bool cpus_share_cache(int this_cpu, int that_cpu);     /* Share LLC. */

I think that Yicong is doing what you want with
cpus_share_lowest_cache() which points to cluster when available or
LLC otherwise
https://lore.kernel.org/lkml/20220720081150.22167-1-yangyicong@xxxxxxxxxxxxx/t/#m0ab9fa0fe0c3779b9bbadcfbc1b643dce7cb7618


AFAIU (please correct me if I'm wrong) my AMD EPYC machine has sockets consisting of 12 clusters, each cluster having its own L3 cache.

What I am trying to achieve here is really to implement "cpus_share_l2": I want this to match only when the cpus have a common L2 cache. L3 appears to be a group which is either:

- too large (16 hw threads) or
- have a too high access latency.

I'm not certain which (or if both) of those reasons explain why
grouping by L2 is better here.

Re-reading the patch you pointed me to, I notice:

"+ * Whether CPUs are share lowest cache, which means LLC on non-cluster
+ * machines and LLC tag or L2 on machines with clusters."

So this "share lowest cache" really means lowest in terms of number, e.g. L2 < L3, and not "lowest in the hierarchy" as is "closest to memory", correct ?

Thanks,

Mathieu


Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com