Re: [PATCH v9 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path

From: Yicong Yang
Date: Tue Aug 01 2023 - 08:07:06 EST


Hi Chenyu,

Sorry for the late reply. Something's wrong and cause this didn't appear
in my mail box. I check it out on the LKML.

On 2023/7/21 17:52, Chen Yu wrote:
> Hi Yicong,
>
> Thanks for sending this version!
>
> On 2023-07-19 at 17:28:38 +0800, Yicong Yang wrote:
>> From: Barry Song <song.bao.hua@xxxxxxxxxxxxx>
>>
>> For platforms having clusters like Kunpeng920, CPUs within the same cluster
>> have lower latency when synchronizing and accessing shared resources like
>> cache. Thus, this patch tries to find an idle cpu within the cluster of the
>> target CPU before scanning the whole LLC to gain lower latency. This
>> will be implemented in 3 steps in select_idle_sibling():
>> 1. When the prev_cpu/recent_used_cpu are good wakeup candidates, use them
>> if they're sharing cluster with the target CPU. Otherwise record them
>> and do the scanning first.
>> 2. Scanning the cluster prior to the LLC of the target CPU for an
>> idle CPU to wakeup.
>> 3. If no idle CPU found after scanning and the prev_cpu/recent_used_cpu
>> can be used, use them.
>>
>> Testing has been done on Kunpeng920 by pinning tasks to one numa and two
>> numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs.
>>
>> With this patch, We noticed enhancement on tbench and netperf within one
>> numa or cross two numa on 6.5-rc1:
>> tbench results (node 0):
>> baseline patched
>> 1: 325.9673 378.9117 ( 16.24%)
>> 4: 1311.9667 1501.5033 ( 14.45%)
>> 8: 2629.4667 2961.9100 ( 12.64%)
>> 16: 5259.1633 5928.0833 ( 12.72%)
>> 32: 10368.6333 10566.8667 ( 1.91%)
>> 64: 7868.7700 8182.0100 ( 3.98%)
>> 128: 6528.5733 6801.8000 ( 4.19%)
>> tbench results (node 0-1):
>> vanilla patched
>> 1: 329.2757 380.8907 ( 15.68%)
>> 4: 1327.7900 1494.5300 ( 12.56%)
>> 8: 2627.2133 2917.1233 ( 11.03%)
>> 16: 5201.3367 5835.9233 ( 12.20%)
>> 32: 8811.8500 11154.2000 ( 26.58%)
>> 64: 15832.4000 19643.7667 ( 24.07%)
>> 128: 12605.5667 14639.5667 ( 16.14%)
>> netperf results TCP_RR (node 0):
>> baseline patched
>> 1: 77302.8667 92172.2100 ( 19.24%)
>> 4: 78724.9200 91581.3100 ( 16.33%)
>> 8: 79168.1296 91091.7942 ( 15.06%)
>> 16: 81079.4200 90546.5225 ( 11.68%)
>> 32: 82201.5799 78910.4982 ( -4.00%)
>> 64: 29539.3509 29131.4698 ( -1.38%)
>> 128: 12082.7522 11956.7705 ( -1.04%)
>> netperf results TCP_RR (node 0-1):
>> baseline patched
>> 1: 78340.5233 92101.8733 ( 17.57%)
>> 4: 79644.2483 91326.7517 ( 14.67%)
>> 8: 79557.4313 90737.8096 ( 14.05%)
>> 16: 79215.5304 90568.4542 ( 14.33%)
>> 32: 78999.3983 85460.6044 ( 8.18%)
>> 64: 74198.9494 74325.4361 ( 0.17%)
>> 128: 27397.4810 27757.5471 ( 1.31%)
>> netperf results UDP_RR (node 0):
>> baseline patched
>> 1: 95721.9367 111546.1367 ( 16.53%)
>> 4: 96384.2250 110036.1408 ( 14.16%)
>> 8: 97460.6546 109968.0883 ( 12.83%)
>> 16: 98876.1687 109387.8065 ( 10.63%)
>> 32: 104364.6417 105241.6767 ( 0.84%)
>> 64: 37502.6246 37451.1204 ( -0.14%)
>> 128: 14496.1780 14610.5538 ( 0.79%)
>> netperf results UDP_RR (node 0-1):
>> baseline patched
>> 1: 96176.1633 111397.5333 ( 15.83%)
>> 4: 94758.5575 105681.7833 ( 11.53%)
>> 8: 94340.2200 104138.3613 ( 10.39%)
>> 16: 95208.5285 106714.0396 ( 12.08%)
>> 32: 74745.9028 100713.8764 ( 34.74%)
>> 64: 59351.4977 73536.1434 ( 23.90%)
>> 128: 23755.4971 26648.7413 ( 12.18%)
>>
>> Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so the SMT branch
>> in the code has not been tested but it supposed to work.
>>
>> Chen Yu also noticed this will improve the performance of tbench and
>> netperf on a 24 CPUs Jacobsville machine, there are 4 CPUs in one
>> cluster sharing L2 Cache.
>>
>> Suggested-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
>> [https://lore.kernel.org/lkml/Ytfjs+m1kUs0ScSn@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx]
>> Tested-by: Yicong Yang <yangyicong@xxxxxxxxxxxxx>
>> Signed-off-by: Barry Song <song.bao.hua@xxxxxxxxxxxxx>
>> Signed-off-by: Yicong Yang <yangyicong@xxxxxxxxxxxxx>
>> Reviewed-by: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
>> Reviewed-by: Chen Yu <yu.c.chen@xxxxxxxxx>
>> ---
>> kernel/sched/fair.c | 59 +++++++++++++++++++++++++++++++++++++----
>> kernel/sched/sched.h | 1 +
>> kernel/sched/topology.c | 12 +++++++++
>> 3 files changed, 67 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b3e25be58e2b..d91bf64f81f5 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7012,6 +7012,30 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>> }
>> }
>>
>> + if (static_branch_unlikely(&sched_cluster_active)) {
>> + struct sched_group *sg = sd->groups;
>> +
>> + if (sg->flags & SD_CLUSTER) {
>> + for_each_cpu_wrap(cpu, sched_group_span(sg), target + 1) {
>> + if (!cpumask_test_cpu(cpu, cpus))
>> + continue;
>> +
>> + if (has_idle_core) {
>> + i = select_idle_core(p, cpu, cpus, &idle_cpu);
>> + if ((unsigned int)i < nr_cpumask_bits)
>> + return i;
>> + } else {
>> + if (--nr <= 0)
>> + return -1;
>> + idle_cpu = __select_idle_cpu(cpu, p);
>> + if ((unsigned int)idle_cpu < nr_cpumask_bits)
>> + return idle_cpu;
>> + }
>> + }
>> + cpumask_andnot(cpus, cpus, sched_group_span(sg));
>> + }
>> + }
>> +
>> for_each_cpu_wrap(cpu, cpus, target + 1) {
>> if (has_idle_core) {
>> i = select_idle_core(p, cpu, cpus, &idle_cpu);
>> @@ -7019,7 +7043,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>> return i;
>>
>> } else {
>> - if (!--nr)
>> + if (--nr <= 0)
>> return -1;
>> idle_cpu = __select_idle_cpu(cpu, p);
>> if ((unsigned int)idle_cpu < nr_cpumask_bits)
>> @@ -7121,7 +7145,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>> bool has_idle_core = false;
>> struct sched_domain *sd;
>> unsigned long task_util, util_min, util_max;
>> - int i, recent_used_cpu;
>> + int i, recent_used_cpu, prev_aff = -1;
>>
>> /*
>> * On asymmetric system, update task utilization because we will check
>> @@ -7148,8 +7172,14 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>> */
>> if (prev != target && cpus_share_cache(prev, target) &&
>> (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
>> - asym_fits_cpu(task_util, util_min, util_max, prev))
>> - return prev;
>> + asym_fits_cpu(task_util, util_min, util_max, prev)) {
>> + if (!static_branch_unlikely(&sched_cluster_active))
>> + return prev;
>> +
>> + if (cpus_share_resources(prev, target))
>> + return prev;
>
> I have one minor question, previously Peter mentioned that he wants to get rid of the
> percpu sd_share_id, not sure if he means that not using it in select_idle_cpu()
> or remove that variable completely to not introduce extra space?
> Hi Peter, could you please give us more hints on this? thanks.
>
> If we wants to get rid of this variable, would this work?
>
> if ((sd->groups->flags & SD_CLUSTER) &&
> cpumask_test_cpu(prev, sched_group_span(sd->groups))
> return prev
>

In the current implementation, nop, we haven't deferenced the @sd yet and we don't
need to if scanning is not needed.

Since we're on the quick path without scanning here, I wonder it'll be a bit more
efficient to use a per-cpu id rather than deference the rcu and do the bitmap
computation.

Thanks.