Re: [PATCH v7] sched/fair: Consider cpu affinity when allowing NUMA imbalance in find_idlest_group

From: K Prateek Nayak
Date: Mon Jun 13 2022 - 11:02:17 EST


Hello Peter,

Thank you for reviewing and picking up this patch.
I've shared some observations w.r.t. p->nr_cpus_allowed,
num_online_cpus() and NR_CPUS based on the discussion in
the thread.

tl;dr

p->nr_cpus_allowed seems to be equal to the num_online_cpus()
at the time of fork as opposed to NR_CPUS based on my testing.

On 6/9/2022 5:24 PM, Peter Zijlstra wrote:
> On Thu, Apr 07, 2022 at 04:42:22PM +0530, K Prateek Nayak wrote:
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d4bd299d67ab..520593bf0de6 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -9215,6 +9215,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>>
>> case group_has_spare:
>> if (sd->flags & SD_NUMA) {
>> + int imb;
>> #ifdef CONFIG_NUMA_BALANCING
>> int idlest_cpu;
>> /*
>> @@ -9232,10 +9233,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>> * Otherwise, keep the task close to the wakeup source
>> * and improve locality if the number of running tasks
>> * would remain below threshold where an imbalance is
>> - * allowed. If there is a real need of migration,
>> - * periodic load balance will take care of it.
>> + * allowed while accounting for the possibility the
>> + * task is pinned to a subset of CPUs. If there is a
>> + * real need of migration, periodic load balance will
>> + * take care of it.
>> */
>> - if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
>> + imb = sd->imb_numa_nr;
>> + if (p->nr_cpus_allowed != num_online_cpus()) {
>> + struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
>> +
>> + cpumask_and(cpus, sched_group_span(local), p->cpus_ptr);
>> + imb = min(cpumask_weight(cpus), sd->imb_numa_nr);
>> + }
>> + if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, imb))
>> return NULL;
>> }
>
> OK, so I've gone collecting patches, and this conflicts with the NUMA
> patches from Mel.
>
> Now, I can (and have) fixed up the conflict,

Thank you for resolving the conflicts.

but it did make me look at
> this in a little more detail; and the thing I noticed is that your:
>
> 'p->nr_cpus_allowed != num_online_cpus()'
>
> test makes no sense. That's basically 'true'. The thing is,
> nr_cpus_allowed is initialized to NR_CPUS, and unless someone somewhere
> did set_cpus_allowed() on it, it'll still be NR_CPUS.
>
> Also, CPU hotplug doesn't change nr_cpus_allowed, so num_online_cpus()
> is just plain wrong.

I agree this is true. If we offline CPUs, the p->nr_cpus_allowed of
already running tasks will remain same but the wakeup path through
find_idlest_group is only traversed during initial placment and from
what I understand, the nr_cpus_allowed during fork will be equal to,
or less than num_online_cpus() unless there is a race with CPU hotplug.

To verify the same, I added the below debug patch to
5.19.0-rc2 tip/sched/core at
commit: f3dd3f674555 "sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle"
to check the values of p->nr_cpus_allowed, num_online_cpus() and NR_CPUS,
when we are in find_idlest_group() method:

--
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8bed75757e65..596d45d148b0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9169,6 +9169,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
break;

case group_has_spare:
+ trace_printk("%s: %u %u %u\n", p->comm, p->nr_cpus_allowed, num_online_cpus(), NR_CPUS);
#ifdef CONFIG_NUMA
if (sd->flags & SD_NUMA) {
int imb_numa_nr = sd->imb_numa_nr;
--

Testing was done on a single socket 64C/128T Zen3 machine.
The configuration consists of 4 sched domain (SMT, MC, NUMA1, NUMA2)
and hence same log appears 4 times as we traverse down the hierarchy
in find_idlest_cpu() that calls find_idlest_group() at each level.
Following are the commands ran along with the relavent logs:

# _-----------> p->nr_cpus_allowed
# / _-------> num_online_cpus()
# | / _---> NR_CPUS
# | | /
# | | |
ls-2750 [006] d..1. 97.486667: find_idlest_group: bash: 128 128 8192
ls-2750 [006] d..1. 97.486669: find_idlest_group: bash: 128 128 8192
ls-2750 [006] d..1. 97.486670: find_idlest_group: bash: 128 128 8192
ls-2750 [006] d..1. 97.486671: find_idlest_group: bash: 128 128 8192
# echo 0 > /sys/devices/system/cpu/cpu127/online
# ls
ls-2753 [070] d..1. 103.661735: find_idlest_group: bash: 127 127 8192
ls-2753 [070] d..1. 103.661737: find_idlest_group: bash: 127 127 8192
ls-2753 [070] d..1. 103.661739: find_idlest_group: bash: 127 127 8192
ls-2753 [070] d..1. 103.661740: find_idlest_group: bash: 127 127 8192
# echo 0 > /sys/devices/system/cpu/cpu126/online
# ls
ls-2757 [070] d..1. 109.111868: find_idlest_group: bash: 126 126 8192
ls-2757 [070] d..1. 109.111871: find_idlest_group: bash: 126 126 8192
ls-2757 [070] d..1. 109.111872: find_idlest_group: bash: 126 126 8192
ls-2757 [070] d..1. 109.111873: find_idlest_group: bash: 126 126 8192
# echo 0 > /sys/devices/system/cpu/cpu125/online
# ls
ls-2760 [070] d..1. 113.816679: find_idlest_group: bash: 125 125 8192
ls-2760 [070] d..1. 113.816682: find_idlest_group: bash: 125 125 8192
ls-2760 [070] d..1. 113.816683: find_idlest_group: bash: 125 125 8192
ls-2760 [070] d..1. 113.816684: find_idlest_group: bash: 125 125 8192
# echo 0 > /sys/devices/system/cpu/cpu124/online
# ls
ls-2763 [066] d..1. 118.288481: find_idlest_group: bash: 124 124 8192
ls-2763 [066] d..1. 118.288483: find_idlest_group: bash: 124 124 8192
ls-2763 [066] d..1. 118.288485: find_idlest_group: bash: 124 124 8192
ls-2763 [066] d..1. 118.288485: find_idlest_group: bash: 124 124 8192
# taskset -c 0-127 ls
ls-2765 [066] d..1. 133.659570: find_idlest_group: bash: 124 124 8192
ls-2765 [066] d..1. 133.659572: find_idlest_group: bash: 124 124 8192
ls-2765 [066] d..1. 133.659573: find_idlest_group: bash: 124 124 8192
ls-2765 [066] d..1. 133.659574: find_idlest_group: bash: 124 124 8192
ls-2765 [066] d..1. 133.660128: find_idlest_group: taskset: 124 124 8192
ls-2765 [066] d..1. 133.660130: find_idlest_group: taskset: 124 124 8192
ls-2765 [066] d..1. 133.660132: find_idlest_group: taskset: 124 124 8192
ls-2765 [066] d..1. 133.660132: find_idlest_group: taskset: 124 124 8192
# taskset -c 0-120 ls
ls-2766 [066] d..1. 139.525659: find_idlest_group: bash: 124 124 8192
ls-2766 [066] d..1. 139.525661: find_idlest_group: bash: 124 124 8192
ls-2766 [066] d..1. 139.525662: find_idlest_group: bash: 124 124 8192
ls-2766 [066] d..1. 139.525663: find_idlest_group: bash: 124 124 8192
ls-2766 [066] d..1. 139.526283: find_idlest_group: taskset: 121 124 8192
ls-2766 [066] d..1. 139.526285: find_idlest_group: taskset: 121 124 8192
ls-2766 [066] d..1. 139.526287: find_idlest_group: taskset: 121 124 8192
ls-2766 [066] d..1. 139.526288: find_idlest_group: taskset: 121 124 8192

>From the above logs, we can see:
o p->nr_cpus_allowed is equal to the value of num_online_cpus() at fork.
o p->nr_cpus_allowed of new task and num_online_cpus() change with CPU
hotplug activity.
o taskset with mask containing offlined CPUs will only count the online
CPUs towards p->nr_cpus_allowed.
o NR_CPUS seem to be remain same at 8192 despite the hotplug activity.

Following are the relevant lines from my .config for NR_CPUS:

CONFIG_NR_CPUS_RANGE_BEGIN=8192
CONFIG_NR_CPUS_RANGE_END=8192
CONFIG_NR_CPUS_DEFAULT=8192
CONFIG_NR_CPUS=8192

When there is a race between CPU hotpug activity, there might be a
difference between p->nr_cpus_allowed and num_online_cpus() as shown
in the logs below:

# _-----------> p->nr_cpus_allowed
# / _-------> num_online_cpus()
# | / _---> NR_CPUS
# | | /
# | | |
<...>-2863 [071] d..1. 1172.474901: find_idlest_group: bash: 124 124 8192
<...>-2863 [071] d..1. 1172.474904: find_idlest_group: bash: 124 124 8192
<...>-2863 [071] d..1. 1172.474906: find_idlest_group: bash: 124 124 8192
<...>-2863 [071] d..1. 1172.474907: find_idlest_group: bash: 124 124 8192
# systemd-udevd is waking up
# echo 0 > /sys/devices/system/cpu/cpu123/online
# systemd-udevd reaches find_idlest_group()
<...>-1612 [059] d..1. 1172.515209: find_idlest_group: systemd-udevd: 124 123 8192
<...>-1612 [059] d..1. 1172.515210: find_idlest_group: systemd-udevd: 124 123 8192
<...>-1612 [059] d..1. 1172.515212: find_idlest_group: systemd-udevd: 124 123 8192
<...>-1612 [059] d..1. 1172.515213: find_idlest_group: systemd-udevd: 124 123 8192
<...>-1612 [059] d..1. 1172.515449: find_idlest_group: systemd-udevd: 124 123 8192
<...>-1612 [059] d..1. 1172.515450: find_idlest_group: systemd-udevd: 124 123 8192
<...>-1612 [059] d..1. 1172.515452: find_idlest_group: systemd-udevd: 124 123 8192
<...>-1612 [059] d..1. 1172.515452: find_idlest_group: systemd-udevd: 124 123 8192
<...>-2863 [071] d..1. 1172.884914: find_idlest_group: bash: 123 123 8192
<...>-2863 [071] d..1. 1172.884916: find_idlest_group: bash: 123 123 8192
<...>-2863 [071] d..1. 1172.884918: find_idlest_group: bash: 123 123 8192
<...>-2863 [071] d..1. 1172.884919: find_idlest_group: bash: 123 123 8192

Here, the systemd-udevd process has p->nr_cpus_allowed set to 124
during the fork but by the time it reaches find_idlest_group(),
CPU 123 goes offline and hence num_online_cpus() reports 123.

>
> Now, something that might work is:
>
> 'p->nr_cpus_allowed < num_online_cpus()'
>

This condition can help the case above when the initial task
placement races with CPU offlining but if a CPU comes online,
as a part of hotplug, the condition will be true and we'll end
up performing the cpumask operation. Hence as it stands,
"p->nr_cpus_allowed != num_online_cpus()" should be a good
enough check for most cases.

> And even that is no guarantee. You can construct a situation where this
> is still false even though you actually have a constrained set.
> Consider a machine with 8 CPUs. Then set the mask to: 0x55, which has 4
> CPUs set. Then offline the last 4 so that the online mask becomes 0x0f.
>
> Then the effective mask is 0x05, and the number we're looking for above
> is 2, but the suggested test would still be false, because
> nr_cpus_allowed would be 4, as would num_online_cpus().

When we use a taskset to define a new task's affinity to 0x55 and
concurrently there is a offline operation in progress which will offline
CPUs 4-7, and the new task creation gets delayed, in that case we will end
up with a situation that p->nr_cpus_allowed == 4 and num_online_cpus() == 4,
but p->cpus_ptr is not a subset of cpus_online_mask.

However given that in practice the CPU offline operation is rare, and the
fact that the path through find_idlest_group() will most likely be traversed
only during the initial placement, such a scenario is extremely unlikely.

>
> Find below what I've made of it (on top of Mel's patches), but I'm not
> sure this is what we want. For now I'll leave it commented out.
>
> ---
> Subject: sched/fair: Consider CPU affinity when allowing NUMA imbalance in find_idlest_group()
> From: K Prateek Nayak <kprateek.nayak@xxxxxxx>
> Date: Thu, 7 Apr 2022 16:42:22 +0530
>
> From: K Prateek Nayak <kprateek.nayak@xxxxxxx>
>
> In the case of systems containing multiple LLCs per socket, like
> AMD Zen systems, users want to spread bandwidth hungry applications
> across multiple LLCs. Stream is one such representative workload where
> the best performance is obtained by limiting one stream thread per LLC.
> To ensure this, users are known to pin the tasks to a specify a subset
> of the CPUs consisting of one CPU per LLC while running such bandwidth
> hungry tasks.
>
> Suppose we kickstart a multi-threaded task like stream with 8 threads
> using taskset or numactl to run on a subset of CPUs on a 2 socket Zen3
> server where each socket contains 128 CPUs
> (0-63,128-191 in one socket, 64-127,192-255 in another socket)
>
> Eg: numactl -C 0,16,32,48,64,80,96,112 ./stream8
>
> Here each CPU in the list is from a different LLC and 4 of those LLCs
> are on one socket, while the other 4 are on another socket.
>
> Ideally we would prefer that each stream thread runs on a different
> CPU from the allowed list of CPUs. However, the current heuristics in
> find_idlest_group() do not allow this during the initial placement.
>
> Suppose the first socket (0-63,128-191) is our local group from which
> we are kickstarting the stream tasks. The first four stream threads
> will be placed in this socket. When it comes to placing the 5th
> thread, all the allowed CPUs are from the local group (0,16,32,48)
> would have been taken.
>
> However, the current scheduler code simply checks if the number of
> tasks in the local group is fewer than the allowed numa-imbalance
> threshold. This threshold was previously 25% of the NUMA domain span
> (in this case threshold = 32) but after the v6 of Mel's patchset
> "Adjust NUMA imbalance for multiple LLCs", got merged in sched-tip,
> Commit: e496132ebedd ("sched/fair: Adjust the allowed NUMA imbalance
> when SD_NUMA spans multiple LLCs") it is now equal to number of LLCs
> in the NUMA domain, for processors with multiple LLCs.
> (in this case threshold = 8).
>
> For this example, the number of tasks will always be within threshold
> and thus all the 8 stream threads will be woken up on the first socket
> thereby resulting in sub-optimal performance.
>
> The following sched_wakeup_new tracepoint output shows the initial
> placement of tasks in the current tip/sched/core on the Zen3 machine:
>
> stream-5313 [016] d..2. 627.005036: sched_wakeup_new: comm=stream pid=5315 prio=120 target_cpu=032
> stream-5313 [016] d..2. 627.005086: sched_wakeup_new: comm=stream pid=5316 prio=120 target_cpu=048
> stream-5313 [016] d..2. 627.005141: sched_wakeup_new: comm=stream pid=5317 prio=120 target_cpu=000
> stream-5313 [016] d..2. 627.005183: sched_wakeup_new: comm=stream pid=5318 prio=120 target_cpu=016
> stream-5313 [016] d..2. 627.005218: sched_wakeup_new: comm=stream pid=5319 prio=120 target_cpu=016
> stream-5313 [016] d..2. 627.005256: sched_wakeup_new: comm=stream pid=5320 prio=120 target_cpu=016
> stream-5313 [016] d..2. 627.005295: sched_wakeup_new: comm=stream pid=5321 prio=120 target_cpu=016
>
> Once the first four threads are distributed among the allowed CPUs of
> socket one, the rest of the treads start piling on these same CPUs
> when clearly there are CPUs on the second socket that can be used.
>
> Following the initial pile up on a small number of CPUs, though the
> load-balancer eventually kicks in, it takes a while to get to {4}{4}
> and even {4}{4} isn't stable as we observe a bunch of ping ponging
> between {4}{4} to {5}{3} and back before a stable state is reached
> much later (1 Stream thread per allowed CPU) and no more migration is
> required.
>
> We can detect this piling and avoid it by checking if the number of
> allowed CPUs in the local group are fewer than the number of tasks
> running in the local group and use this information to spread the
> 5th task out into the next socket (after all, the goal in this
> slowpath is to find the idlest group and the idlest CPU during the
> initial placement!).
>
> The following sched_wakeup_new tracepoint output shows the initial
> placement of tasks after adding this fix on the Zen3 machine:
>
> stream-4485 [016] d..2. 230.784046: sched_wakeup_new: comm=stream pid=4487 prio=120 target_cpu=032
> stream-4485 [016] d..2. 230.784123: sched_wakeup_new: comm=stream pid=4488 prio=120 target_cpu=048
> stream-4485 [016] d..2. 230.784167: sched_wakeup_new: comm=stream pid=4489 prio=120 target_cpu=000
> stream-4485 [016] d..2. 230.784222: sched_wakeup_new: comm=stream pid=4490 prio=120 target_cpu=112
> stream-4485 [016] d..2. 230.784271: sched_wakeup_new: comm=stream pid=4491 prio=120 target_cpu=096
> stream-4485 [016] d..2. 230.784322: sched_wakeup_new: comm=stream pid=4492 prio=120 target_cpu=080
> stream-4485 [016] d..2. 230.784368: sched_wakeup_new: comm=stream pid=4493 prio=120 target_cpu=064
>
> We see that threads are using all of the allowed CPUs and there is
> no pileup.
>
> No output is generated for tracepoint sched_migrate_task with this
> patch due to a perfect initial placement which removes the need
> for balancing later on - both across NUMA boundaries and within
> NUMA boundaries for stream.
>
> Following are the results from running 8 Stream threads with and
> without pinning on a dual socket Zen3 Machine (2 x 64C/128T):
>
> During the testing of this patch, the tip sched/core was at
> commit: 089c02ae2771 "ftrace: Use preemption model accessors for trace
> header printout"
>
> Pinning is done using: numactl -C 0,16,32,48,64,80,96,112 ./stream8
>
> 5.18.0-rc1 5.18.0-rc1 5.18.0-rc1
> tip sched/core tip sched/core tip sched/core
> (no pinning) + pinning + this-patch
> + pinning
>
> Copy: 109364.74 (0.00 pct) 94220.50 (-13.84 pct) 158301.28 (44.74 pct)
> Scale: 109670.26 (0.00 pct) 90210.59 (-17.74 pct) 149525.64 (36.34 pct)
> Add: 129029.01 (0.00 pct) 101906.00 (-21.02 pct) 186658.17 (44.66 pct)
> Triad: 127260.05 (0.00 pct) 106051.36 (-16.66 pct) 184327.30 (44.84 pct)
>
> Pinning currently hurts the performance compared to unbound case on
> tip/sched/core. With the addition of this patch, we are able to
> outperform tip/sched/core by a good margin with pinning.
>
> Following are the results from running 16 Stream threads with and
> without pinning on a dual socket IceLake Machine (2 x 32C/64T):
>
> NUMA Topology of Intel Skylake machine:
> Node 1: 0,2,4,6 ... 126 (Even numbers)
> Node 2: 1,3,5,7 ... 127 (Odd numbers)
>
> Pinning is done using: numactl -C 0-15 ./stream16
>
> 5.18.0-rc1 5.18.0-rc1 5.18.0-rc1
> tip sched/core tip sched/core tip sched/core
> (no pinning) +pinning + this-patch
> + pinning
>
> Copy: 85815.31 (0.00 pct) 149819.21 (74.58 pct) 156807.48 (82.72 pct)
> Scale: 64795.60 (0.00 pct) 97595.07 (50.61 pct) 99871.96 (54.13 pct)
> Add: 71340.68 (0.00 pct) 111549.10 (56.36 pct) 114598.33 (60.63 pct)
> Triad: 68890.97 (0.00 pct) 111635.16 (62.04 pct) 114589.24 (66.33 pct)
>
> In case of Icelake machine, with single LLC per socket, pinning across
> the two sockets reduces cache contention, thus showing great
> improvement in pinned case which is further benefited by this patch.
>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> Reviewed-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> Reviewed-by: Srikar Dronamraju <srikar@xxxxxxxxxxxxxxxxxx>
> Acked-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
> Link: https://lkml.kernel.org/r/20220407111222.22649-1-kprateek.nayak@xxxxxxx
> ---
> kernel/sched/fair.c | 16 +++++++++++++---
> 1 file changed, 13 insertions(+), 3 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9210,6 +9210,7 @@ find_idlest_group(struct sched_domain *s
> case group_has_spare:
> #ifdef CONFIG_NUMA
> if (sd->flags & SD_NUMA) {
> + int imb_numa_nr = sd->imb_numa_nr;
> #ifdef CONFIG_NUMA_BALANCING
> int idlest_cpu;
> /*
> @@ -9227,13 +9228,22 @@ find_idlest_group(struct sched_domain *s
> * Otherwise, keep the task close to the wakeup source
> * and improve locality if the number of running tasks
> * would remain below threshold where an imbalance is
> - * allowed. If there is a real need of migration,
> - * periodic load balance will take care of it.
> + * allowed while accounting for the possibility the
> + * task is pinned to a subset of CPUs. If there is a
> + * real need of migration, periodic load balance will
> + * take care of it.
> */
> + if (p->nr_cpus_allowed < num_online_cpus()) {
> + struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> +
> + cpumask_and(cpus, sched_group_span(local), p->cpus_ptr);
> + imb_numa_nr = min(cpumask_weight(cpus), sd->imb_numa_nr);
> + }
> +
> imbalance = abs(local_sgs.idle_cpus - idlest_sgs.idle_cpus);
> if (!adjust_numa_imbalance(imbalance,
> local_sgs.sum_nr_running + 1,
> - sd->imb_numa_nr)) {
> + imb_numa_nr)) {
> return NULL;
> }
> }

Thank you again for resolving the conflict and picking up
this patch.
--
Thanks and Regards,
Prateek