Re: [RFC 2/3] sched/topology: fix sched groups on NUMA machines with mesh topology

From: Lauro Venancio
Date: Thu Apr 13 2017 - 16:21:11 EST


On 04/13/2017 12:48 PM, Peter Zijlstra wrote:
> On Thu, Apr 13, 2017 at 10:56:08AM -0300, Lauro Ramos Venancio wrote:
>> Currently, on a 4 nodes NUMA machine with ring topology, two sched
>> groups are generated for the last NUMA sched domain. One group has the
>> CPUs from NUMA nodes 3, 0 and 1; the other group has the CPUs from nodes
>> 1, 2 and 3. As CPUs from nodes 1 and 3 belongs to both groups, the
>> scheduler is unable to directly move tasks between these nodes. In the
>> worst scenario, when a set of tasks are bound to nodes 1 and 3, the
>> performance is severely impacted because just one node is used while the
>> other node remains idle.
> I feel a picture would be ever so much clearer.
>
>> This patch constructs the sched groups from each CPU perspective. So, on
>> a 4 nodes machine with ring topology, while nodes 0 and 2 keep the same
>> groups as before [(3, 0, 1)(1, 2, 3)], nodes 1 and 3 have new groups
>> [(0, 1, 2)(2, 3, 0)]. This allows moving tasks between any node 2-hops
>> apart.
> So I still have no idea what specifically goes wrong and how this fixes
> it. Changelog is impenetrable.
On a 4 nodes machine with ring topology, the last sched domain level
contains groups with 3 numa nodes each. So we have four possible groups:
(0, 1, 2) (1, 2, 3) (2, 3, 0)(3, 0, 1). As we need just two groups to
fill the sched domain, currently, the groups (3, 0, 1) and (1, 2, 3) are
used for all CPUs. The problem with it is that nodes 1 and 3 belongs to
both groups, becoming impossible to move tasks between these two nodes.

This patch uses different groups depending on the CPU they are
installed. So nodes 0 and 2 CPUs keep the same group as before: (3, 0,
1) and (1, 2, 3). Nodes 1 and 3 CPUs use the new groups: (0, 1, 2) and
(2, 3, 0). So the first pair of groups allows movement between nodes 0
and 2; and the second pair of groups allows movement between nodes 1 and 3.

I will improve the changelog.

> "From each CPU's persepective" doesn't really help, there already is a
> for_each_cpu() in.
The for_each_cpu() is used to iterate across all sched domain cpus. It
doesn't consider the CPU where the groups are being installed (parameter
cpu in build_overlap_sched_groups()). Currently, the parameter cpu is
used just for memory allocation and for ordering the groups, it doesn't
change the groups that are chosen. This patch uses the parameter cpu to
choose the first group, changing also, as consequence, the second group.
>
> Also, since I'm not sure what happend to the 4 node system, I cannot
> begin to imagine what would happen on the 8 node one.