Re: [PATCH v6] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

From: K Prateek Nayak
Date: Tue Mar 08 2022 - 06:48:40 EST


Hello Srikar,

On 3/8/2022 2:59 PM, Srikar Dronamraju wrote:
> [..snip..]
> If I recollect correctly, each stream thread, has its independent data set.
> However if the threads were all to contend for the same resource (memory) or
> a waker/wakee relationships, would we not end up spreading the waker/wakee
> apart?

We only spread tasks based on number of allowed cpus in the domain.
For Stream with 8 threads, where it is beneficial for the threads to
spread across LLCs, user might pin the task as follows:

numactl -C 0,16,32,48,64,80,96,112 ./stream8

For communicating processes, binding as shown above would
indeed be bad for performance. Instead, user may prefer a
different subset of cpus to pin the process to such as:

numactl -C 0-7,128-135 ./communicating # Bind to single LLC

This will ensure the communicating threads are on the same LLC.

>> [..snip..]
>>
>> Pinning is done using: numactl -C 0,16,32,48,64,80,96,112 ./stream8
>>
>> 5.17.0-rc1 5.17.0-rc1 5.17.0-rc1
>> tip sched/core tip sched/core tip sched/core
>> (no pinning) +pinning + this-patch
>> + pinning
>>
>> Copy: 97699.28 (0.00 pct) 95933.60 (-1.80 pct) 156578.91 (60.26 pct)
>> Scale: 107754.15 (0.00 pct) 91869.88 (-14.74 pct) 149783.25 (39.00 pct)
>> Add: 126383.29 (0.00 pct) 105730.86 (-16.34 pct) 186493.09 (47.56 pct)
>> Triad: 124896.78 (0.00 pct) 106394.38 (-14.81 pct) 184733.48 (47.90 pct)
>>
> Do we have numbers for the with-your patch - non-pinned case?
Yes, we do.

5.17.0-rc1 5.17.0-rc1
tip sched/core tip sched/core
(no pinning) + this-patch
(no pinning)

 Copy:   97699.28  (0.00 pct)    106114.59  (8.61 pct)
Scale:   107754.15 (0.00 pct)    106814.48 (-0.87 pct)
  Add:   126383.29 (0.00 pct)    125323.78 (-0.83 pct)
Triad:   124896.78 (0.00 pct)    123622.62 (-1.02 pct)

These are the results on the same dual socket 2 x 64C/128T machine

Following are the output of the tracepoint sched_wakeup_new for stream with
8 threads without any pinning:

- 5.17-rc1 tip/sched/core (no pinning)

stream-4570    [025] d..2.   115.786096: sched_wakeup_new: comm=stream pid=4572 prio=120 target_cpu=048  (LLC: 6)
stream-4570    [025] d..2.   115.786160: sched_wakeup_new: comm=stream pid=4573 prio=120 target_cpu=175  (LLC: 5)
stream-4570    [025] d..2.   115.786221: sched_wakeup_new: comm=stream pid=4574 prio=120 target_cpu=000  (LLC: 0)
stream-4570    [025] d..2.   115.786271: sched_wakeup_new: comm=stream pid=4575 prio=120 target_cpu=145  (LLC: 2)
stream-4570    [025] d..2.   115.786329: sched_wakeup_new: comm=stream pid=4576 prio=120 target_cpu=138  (LLC: 1)
stream-4570    [025] d..2.   115.786375: sched_wakeup_new: comm=stream pid=4577 prio=120 target_cpu=059  (LLC: 7)
stream-4570    [025] d..2.   115.786415: sched_wakeup_new: comm=stream pid=4578 prio=120 target_cpu=036  (LLC: 4)

- 5.17-rc1 tip/sched/core + this-patch (no pinning)

stream-4537    [191] d..2.   115.926113: sched_wakeup_new: comm=stream pid=4539 prio=120 target_cpu=162  (LLC: 4)
stream-4537    [191] d..2.   115.926181: sched_wakeup_new: comm=stream pid=4540 prio=120 target_cpu=000  (LLC: 0)
stream-4537    [191] d..2.   115.926235: sched_wakeup_new: comm=stream pid=4541 prio=120 target_cpu=144  (LLC: 2)
stream-4537    [191] d..2.   115.926284: sched_wakeup_new: comm=stream pid=4542 prio=120 target_cpu=025  (LLC: 3)
stream-4537    [191] d..2.   115.926332: sched_wakeup_new: comm=stream pid=4543 prio=120 target_cpu=048  (LLC: 6)
stream-4537    [191] d..2.   115.926376: sched_wakeup_new: comm=stream pid=4544 prio=120 target_cpu=137  (LLC: 1)
stream-4537    [191] d..2.   115.926436: sched_wakeup_new: comm=stream pid=4545 prio=120 target_cpu=041  (LLC: 5)

All the threads remain in the same socket as imb_numa_nr in NPS1 mode
is 8 but each thread gets a separate LLC.
> I would assume they would be the same as 1st column since your change
> affects only pinned case. But I am wondering if this problem happens in the
> unpinned case or not?

We see nearly identical result in unpinned cases - with and
without this patch.

> Also Stream on powerpc seems to have some variation in results, did we take
> a mean of runs, or is it just results of just one run?
The results posted is the mean of 10 runs.
>> Pinning currently hurts the performance compared to unbound case on
>> tip/sched/core. With the addition of this patch, we are able to
>> outperform tip/sched/core by a good margin with pinning.
>>
>> Following are the results from running 16 Stream threads with and
>> without pinning on a dual socket Skylake Machine (2 x 24C/48T):
>>
>> Pinning is done using: numactl -C 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 ./stream16
>>
>> 5.17.0-rc1 5.17.0-rc1 5.17.0-rc1
>> tip sched/core tip sched/core tip sched/core
>> (no pinning) +pinning + this-patch
>> + pinning
>>
>> Copy: 126620.67 (0.00 pct) 141062.10 (11.40 pct) 147615.44 (16.58 pct)
>> Scale: 91313.51 (0.00 pct) 112879.61 (23.61 pct) 122591.28 (34.25 pct)
>> Add: 102035.43 (0.00 pct) 125889.98 (23.37 pct) 138179.01 (35.42 pct)
>> Triad: 102281.91 (0.00 pct) 123743.48 (20.98 pct) 138940.41 (35.84 pct)
>>
>> In case of Skylake machine, with single LLC per socket, we see good
>> improvement brought about by pinning which is further benefited by
>> this patch.
>>
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
>> Acked-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
>> ---
>> Changelog v5-->v6:
>> - Move the cpumask variable declaration to the block it is
>> used in.
>> - Collect tags from v5.
>> Changelog v4-->v5:
>> - Only perform cpumask operations if nr_cpus_allowed is not
>> equal to num_online_cpus based on Mel's suggestion.
>> ---
>> kernel/sched/fair.c | 16 +++++++++++++---
>> 1 file changed, 13 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 16874e112fe6..6cc90d76250f 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -9183,6 +9183,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>>
>> case group_has_spare:
>> if (sd->flags & SD_NUMA) {
>> + int imb;
>> #ifdef CONFIG_NUMA_BALANCING
>> int idlest_cpu;
>> /*
>> @@ -9200,10 +9201,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>> * Otherwise, keep the task close to the wakeup source
>> * and improve locality if the number of running tasks
>> * would remain below threshold where an imbalance is
>> - * allowed. If there is a real need of migration,
>> - * periodic load balance will take care of it.
>> + * allowed while accounting for the possibility the
>> + * task is pinned to a subset of CPUs. If there is a
>> + * real need of migration, periodic load balance will
>> + * take care of it.
>> */
>> - if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
>> + imb = sd->imb_numa_nr;
>> + if (p->nr_cpus_allowed != num_online_cpus()) {
> Again, repeating, is the problem only happening in the pinned case?
Yes. We've tested stream with 8 and 16 stream threads on a Zen3 system
with 16 LLCs and in both cases, with unbound runs, we've seen each
Stream thread get a separate LLC and we didn't observe any stacking.
--
Thanks and Regards,
Prateek