Re: [PATCH v4 1/2] sched: Add per_cpu cluster domain info and cpus_share_resources API

From: Yicong Yang
Date: Thu Jun 16 2022 - 03:55:27 EST


Hi Prateek,

Seems my previous reply is in the wrong format so the server rejected it..just repost..

Thanks a lot for the test!

On 2022/6/15 22:19, K Prateek Nayak wrote:
> Hello Yicong,
>
> I replied to the v3 of the series by mistake!
> (https://lore.kernel.org/lkml/0b065646-5b05-cbdb-b20c-e1dfef3f4d79@xxxxxxx/)
> But rest assured all the analysis discussed there was done with
> the v4 patch series. I'll add the same observations below so we
> can continue discussion on v4.
>
> We are observing some serious regression with tbench with this patch
> series applied. The issue doesn't seem to be related to the actual
> functionality of the patch but how the patch changes the per CPU
> variable layout.
>
> Discussed below are the results from running tbench on a dual
> socket Zen3 (2 x 64C/128T) configured in different NPS modes.
>
> NPS Modes are used to logically divide single socket into
> multiple NUMA region.
> Following is the NUMA configuration for each NPS mode on the system:
>
> NPS1: Each socket is a NUMA node.
> Total 2 NUMA nodes in the dual socket machine.
>
> Node 0: 0-63, 128-191
> Node 1: 64-127, 192-255
>
> NPS2: Each socket is further logically divided into 2 NUMA regions.
> Total 4 NUMA nodes exist over 2 socket.
>
> Node 0: 0-31, 128-159
> Node 1: 32-63, 160-191
> Node 2: 64-95, 192-223
> Node 3: 96-127, 223-255
>
> NPS4: Each socket is logically divided into 4 NUMA regions.
> Total 8 NUMA nodes exist over 2 socket.
>
> Node 0: 0-15, 128-143
> Node 1: 16-31, 144-159
> Node 2: 32-47, 160-175
> Node 3: 48-63, 176-191
> Node 4: 64-79, 192-207
> Node 5: 80-95, 208-223
> Node 6: 96-111, 223-231
> Node 7: 112-127, 232-255
>
> Benchmark Results:
>
> Kernel versions:
> - tip: 5.19.0-rc2 tip sched/core
> - cluster: 5.19.0-rc2 tip sched/core + both the patches of the series
>
> When we started testing, the tip was at:
> commit: f3dd3f674555 "sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle"
>
> * - Data points of concern
>
> ~~~~~~
> tbench
> ~~~~~~
>
> NPS1
>
> Clients: tip cluster
> 1 444.41 (0.00 pct) 439.27 (-1.15 pct)
> 2 879.23 (0.00 pct) 831.49 (-5.42 pct) *
> 4 1648.83 (0.00 pct) 1608.07 (-2.47 pct)
> 8 3263.81 (0.00 pct) 3086.81 (-5.42 pct) *
> 16 6011.19 (0.00 pct) 5360.28 (-10.82 pct) *
> 32 12058.31 (0.00 pct) 8769.08 (-27.27 pct) *
> 64 21258.21 (0.00 pct) 19021.09 (-10.52 pct) *
> 128 30795.27 (0.00 pct) 30861.34 (0.21 pct)
> 256 25138.43 (0.00 pct) 24711.90 (-1.69 pct)
> 512 51287.93 (0.00 pct) 51855.55 (1.10 pct)
> 1024 53176.97 (0.00 pct) 52554.55 (-1.17 pct)
>
> NPS2
>
> Clients: tip cluster
> 1 445.45 (0.00 pct) 441.75 (-0.83 pct)
> 2 869.24 (0.00 pct) 845.61 (-2.71 pct)
> 4 1644.28 (0.00 pct) 1586.49 (-3.51 pct)
> 8 3120.83 (0.00 pct) 2967.01 (-4.92 pct) *
> 16 5972.29 (0.00 pct) 5208.58 (-12.78 pct) *
> 32 11776.38 (0.00 pct) 10229.53 (-13.13 pct) *
> 64 20933.15 (0.00 pct) 17033.45 (-18.62 pct) *
> 128 32195.00 (0.00 pct) 29507.85 (-8.34 pct) *
> 256 24641.52 (0.00 pct) 27225.00 (10.48 pct)
> 512 50806.96 (0.00 pct) 51377.50 (1.12 pct)
> 1024 51993.96 (0.00 pct) 50773.35 (-2.34 pct)
>
> NPS4
>
> Clients: tip cluster
> 1 442.10 (0.00 pct) 435.06 (-1.59 pct)
> 2 870.94 (0.00 pct) 858.64 (-1.41 pct)
> 4 1615.30 (0.00 pct) 1607.27 (-0.49 pct)
> 8 3195.95 (0.00 pct) 3020.63 (-5.48 pct) *
> 16 5937.41 (0.00 pct) 5719.87 (-3.66 pct)
> 32 11800.41 (0.00 pct) 11229.65 (-4.83 pct) *
> 64 20844.71 (0.00 pct) 20432.79 (-1.97 pct)
> 128 31003.62 (0.00 pct) 29441.20 (-5.03 pct) *
> 256 27476.37 (0.00 pct) 25857.30 (-5.89 pct) * [Know to have run to run variance]
> 512 52276.72 (0.00 pct) 51659.16 (-1.18 pct)
> 1024 51372.10 (0.00 pct) 51026.87 (-0.67 pct)
>
> Note: tbench results for 256 workers are known to have
> run to run variation on the test machine. Any regression
> seen for the data point can be safely ignored.
>
> The behavior is consistent for both tip and patched kernel
> across multiple runs of tbench.
>
> ~~~~~~~~~~~~~~~~~~~~
> Analysis done so far
> ~~~~~~~~~~~~~~~~~~~~
>
> To root cause this issue quicker, we have focused on 8 to 64 clients
> data points with the machine running in NPS1 mode.
>
> - Even on disabling HW prefetcher, the behavior remains consistent
> signifying HW prefetcher is not the cause of the problem.
>
> - Bisecting:
>
> When we ran the tests with only Patch 1 of the series, the
> regression was visible and the numbers were worse.
>
> Clients: tip cluster Patch 1 Only
> 8 3263.81 (0.00 pct) 3086.81 (-5.42 pct) 3018.63 (-7.51 pct)
> 16 6011.19 (0.00 pct) 5360.28 (-10.82 pct) 4869.26 (-18.99 pct)
> 32 12058.31 (0.00 pct) 8769.08 (-27.27 pct) 8159.60 (-32.33 pct)
> 64 21258.21 (0.00 pct) 19021.09 (-10.52 pct) 13161.92 (-38.08 pct)
>
> We further bisected the hunks to narrow down the cause to the per CPU
> variable declarations.
>
>
> On 6/9/2022 5:36 PM, Yicong Yang wrote:
>>
>> [..snip..]
>>
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 01259611beb9..b9bcfcf8d14d 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1753,7 +1753,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>> DECLARE_PER_CPU(int, sd_llc_size);
>> DECLARE_PER_CPU(int, sd_llc_id);
>> +DECLARE_PER_CPU(int, sd_share_id);
>> DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>> +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>
> The main reason for the regression seems to be the above declarations.
> The regression seem to go away if we do one of the following:
>
> - Declare sd_share_id and sd_cluster using DECLARE_PER_CPU_READ_MOSTLY()
> instead of DECLARE_PER_CPU() and change the corresponding definition
> below to DEFINE_PER_CPU_READ_MOSTLY().
>
> Clients: tip Patch 1 Patch 1 (READ_MOSTLY)
> 8 3255.69 (0.00 pct) 3018.63 (-7.28 pct) 3237.33 (-0.56 pct)
> 16 6092.67 (0.00 pct) 4869.26 (-20.08 pct) 5914.53 (-2.92 pct)
> 32 11156.56 (0.00 pct) 8159.60 (-26.86 pct) 11536.05 (3.40 pct)
> 64 21019.97 (0.00 pct) 13161.92 (-37.38 pct) 21162.33 (0.67 pct)
>
> - Convert sd_share_id and sd_cluster to static arrays.
>
> Clients: tip Patch 1 Patch 1 (Static Array)
> 8 3255.69 (0.00 pct) 3018.63 (-7.28 pct) 3203.27 (-1.61 pct)
> 16 6092.67 (0.00 pct) 4869.26 (-20.08 pct) 6198.35 (1.73 pct)
> 32 11156.56 (0.00 pct) 8159.60 (-26.86 pct) 11385.76 (2.05 pct)
> 64 21019.97 (0.00 pct) 13161.92 (-37.38 pct) 21919.80 (4.28 pct)
>
> - Move the declarations of sd_share_id and sd_cluster to the top
>
> Clients: tip Patch 1 Patch 1 (Declarion on Top)
> 8 3255.69 (0.00 pct) 3018.63 (-7.28 pct) 3072.30 (-5.63 pct)
> 16 6092.67 (0.00 pct) 4869.26 (-20.08 pct) 5586.59 (-8.30 pct)
> 32 11156.56 (0.00 pct) 8159.60 (-26.86 pct) 11184.17 (0.24 pct)
> 64 21019.97 (0.00 pct) 13161.92 (-37.38 pct) 20289.70 (-3.47 pct)
>
> Unfortunately, none of these are complete solutions. For example, using
> DECLARE_PER_CPU_READ_MOSTLY() with both patches applied reduces the regression
> but doesn't eliminate it entirely:
>
> Clients: tip cluster cluster (READ_MOSTLY)
> 1 444.41 (0.00 pct) 439.27 (-1.15 pct) 435.95 (-1.90 pct)
> 2 879.23 (0.00 pct) 831.49 (-5.42 pct) 842.09 (-4.22 pct)
> 4 1648.83 (0.00 pct) 1608.07 (-2.47 pct) 1598.77 (-3.03 pct)
> 8 3263.81 (0.00 pct) 3086.81 (-5.42 pct) 3090.86 (-5.29 pct) *
> 16 6011.19 (0.00 pct) 5360.28 (-10.82 pct) 5360.28 (-10.82 pct) *
> 32 12058.31 (0.00 pct) 8769.08 (-27.27 pct) 11083.66 (-8.08 pct) *
> 64 21258.21 (0.00 pct) 19021.09 (-10.52 pct) 20984.30 (-1.28 pct)
> 128 30795.27 (0.00 pct) 30861.34 (0.21 pct) 30735.20 (-0.19 pct)
> 256 25138.43 (0.00 pct) 24711.90 (-1.69 pct) 24021.21 (-4.44 pct)
> 512 51287.93 (0.00 pct) 51855.55 (1.10 pct) 51672.73 (0.75 pct)
> 1024 53176.97 (0.00 pct) 52554.55 (-1.17 pct) 52620.02 (-1.04 pct)
>
> We are still trying to root cause the underlying issue that
> brought about such drastic regression in tbench performance.
>
>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index 05b6c2ad90b9..0595827d481d 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -664,6 +664,8 @@ static void destroy_sched_domains(struct sched_domain *sd)
>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>> DEFINE_PER_CPU(int, sd_llc_size);
>> DEFINE_PER_CPU(int, sd_llc_id);
>> +DEFINE_PER_CPU(int, sd_share_id);
>> +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
>> DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>>
>> [..snip..]
>>
>
> We would like some time to investigate this issue and root cause
> the reason for this regression.

One significant difference I can see for now is that Kunpeng 920 is a non-SMT machine while Zen3
is a SMT machine. So in the select_idle_sibling() path we won't use sd_llc_shared. Since sd_share_id
and sd_cluster are inserted between sd_llc_id and sd_llc_shared which are both used in the path on
your test, I guess the change of the layout may affect something like the cache behaviour.

Can you also test with SMT disabled? Or change the layout a bit to see if there's any difference,
like:

DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
DEFINE_PER_CPU(int, sd_llc_size);
DEFINE_PER_CPU(int, sd_llc_id);
DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
+DEFINE_PER_CPU(int, sd_share_id);
+DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);

I need to further look into it and have some tests on a SMT machine. Would you mind to share
the kernel config as well? I'd like to compare the config as well.

Thanks,
Yicong