RE: [RFC] sched/topology: NUMA topology limitations

From: Song Bao Hua (Barry Song)
Date: Sat Aug 29 2020 - 01:32:41 EST


> Subject: [RFC] sched/topology: NUMA topology limitations
>
> (For those of you who already got this one: sorry! I messed up LKML and
> Vincent's addresses)
>
> Hi,
>
> Some of you may have noticed me struggling to plug some topology holes in
> [1]. While digging in there I realized there are some non-obvious limitations
> wrt supported NUMA topologies; IMO if we don't feel like they should be
> "fixed", they should at the very least be documented somewhere.
>
> I should get to ramble about this at this year's (v)LPC during the scheduler
> µconf, but this isn't really trivial so I figured an email wouldn't hurt. I tried to
> package it into reasonable segments - this is still a pretty big wall of text, so
> use of coffee/tea/beer is recommended.
>
> Spoiler alert: I don't have solutions for these yet.
>
> Diameter issue
> ==============
>
> Definition
> ++++++++++
>
> Each unique NUMA distance gets an index in sched_domains_numa_distance[],
> and as the big wall of text atop topology.c::build_balance_mask() mentions
> this can be seen as an array of hops. i.e. in the following table:
>
> 0 1
> 0: 10 20
> 1: 20 10
>
> sched_domains_numa_distance = { 10, 20 };
>
> distance 10 is 0 hops (i.e. identity distance), distance 20 is 1 hop. This isn't
> strictly speaking always true, as e.g. two consecutive distance values could
> represent the same number of hops (with one path slightly "longer" than the
> other), but it makes explaining that mess a bit easier.
>
> I'm using 'diameter' in the NoC meaning of the term, i.e. the longest shortest
> path between any two nodes, in # of hops.
>
> Diameter <= 2
> +++++++++++++
>
> AFAIU, one of the issues I'm encountering in [1] is that the scheduler topology
> code cannot cope with a NUMA diameter > 2. I've looked into what's already
> out there, but haven't found any (x86) machine that violates this rule. Below
> are some examples.
>
> AMD Epyc
> --------
>
> node 0 1 2 3 4 5 6 7
> 0: 10 16 16 16 32 32 32 32
> 1: 16 10 16 16 32 32 32 32
> 2: 16 16 10 16 32 32 32 32
> 3: 16 16 16 10 32 32 32 32
> 4: 32 32 32 32 10 16 16 16
> 5: 32 32 32 32 16 10 16 16
> 6: 32 32 32 32 16 16 10 16
> 7: 32 32 32 32 16 16 16 10
>
> Here, diameter=2
>
> Opteron 6174
> ------------
>
> This is the 4-node one, which looks like this:
>
> node 0 1 2 3
> 0: 10 20 20 20
> 1: 20 10 20 20
> 2: 20 20 10 20
> 3: 20 20 20 10
>
> 0 ----- 3
> | \ / |
> | X |
> | / \ |
> 1 ----- 2
>
> Clearly this is diameter=1
>
> AMD Bulldozer
> -------------
>
> I'm using the topology described in decade of wasted cores [2]; I'm not going
> to bother reproducing the distance table, the topology looks like:
>
> _______________________
> / _____________ \
> | / \|
> 0 ----- 4 ----- 5 ----- 1
> | \ / | \ / | \ / |
> | X | X | X |
> | / \ | / \ | / \ |
> 6 ----- 2 ----- 3 ----- 7
> | \______________/|
> \_______________________/
>
> Which is diameter=2 (assuming equal edge weights)
>
> Diameter > 2
> ++++++++++++
>
> I haven't found other NUMA systems out there with a diameter > 2 other than
> the one I'm rambling about in [1]. Nevertheless, I think there are some
> theoretical setups that aren't too insane (at least less than [1]).
>
> One of them is a 6-node mesh, which is diameter=3 and would look like:
>
> 0 ----- 5
> | |
> 1 ----- 4
> | |
> 2 ----- 3
>
> A more convoluted one would be a 12-node "spidergon"; it's a ring with an
> even number of nodes N, each node is connected to both of its neighbours and
> there is an interlink between node X and node (X + N/2) % N. I won't pretend I
> can draw it in ASCII, but that again gives us a diameter of 3.
>
> Case study w/ QEMU
> ++++++++++++++++++
>
> Setup
> -----
>
> The topology I'll describe below is a simplified version of the one in [1], as it is
> the smallest reproducer of the issue. Morten has been calling this a
> "segmented bus".
>
> 1 ----- 0
> |
> |
> |
> 2 ----- 3
>
> node 0 1 2 3
> 0: 10 20 30 40
> 1: 20 10 20 30
> 2: 30 20 10 20
> 3: 40 30 20 10
>
> Diameter here is 3. I can recreate this with the following qemu
> incantation:
>
> $ qemu-system-aarch64 -kernel Image -initrd rootfs.cpio -append \
> 'console=ttyAMA0 earlycon=pl011,0x9000000
> root=/dev/vda debug earlyprintk=serial sched_debug' \
> -smp cores=4 -nographic -m 512 -cpu cortex-a53
> -machine virt \
> -numa node,cpus=0,nodeid=0 -numa
> node,cpus=1,nodeid=1, \
> -numa node,cpus=2,nodeid=2, -numa
> node,cpus=3,nodeid=3, \
> -numa dist,src=0,dst=1,val=20, -numa
> dist,src=0,dst=2,val=30, \
> -numa dist,src=0,dst=3,val=40, -numa
> dist,src=1,dst=2,val=20, \
> -numa dist,src=1,dst=3,val=30, -numa
> dist,src=2,dst=3,val=20
>
> If you boot the above with CONFIG_SCHED_DEBUG, you'll get:
>
> [ 0.245896] CPU0 attaching sched-domain(s):
> [ 0.246133] domain-0: span=0-1 level=NUMA
> [ 0.246592] groups: 0:{ span=0 cap=1011 }, 1:{ span=1 cap=1008 }
> [ 0.246998] domain-1: span=0-2 level=NUMA
> [ 0.247145] groups: 0:{ span=0-1 mask=0 cap=2019 }, 2:{ span=1-3
> mask=2 cap=3025 }
> [ 0.247454] ERROR: groups don't span domain->span

Hi Valtentin,
Thanks for your email. It seems it is quite clear. May I ask what is the real harm
when group 2 is actually out of the span of diameter 2 here? What will happen when group2
doesn't span cpu0_domain1->span?
In domain-1, will scheduler fail to do load balance between group0 and group2?

> [ 0.247654] domain-2: span=0-3 level=NUMA
> [ 0.247892] groups: 0:{ span=0-2 mask=0 cap=3021 }, 3:{ span=1-3
> mask=3 cap=3047 }

Here domain-2 includes all span from 0 to 3, so that means we should still be able to do
load balance in domain-2 between cpu0 and cpu3 even we fail in domain-1?

>
> [ 0.248915] CPU1 attaching sched-domain(s):
> [ 0.249050] domain-0: span=0-2 level=NUMA
> [ 0.249181] groups: 1:{ span=1 cap=1008 }, 2:{ span=2 cap=1002 },
> 0:{ span=0 cap=1011 }
> [ 0.249661] domain-1: span=0-3 level=NUMA
> [ 0.249810] groups: 1:{ span=0-2 mask=1 cap=3034 }, 3:{ span=2-3
> mask=3 cap=2023 }
>
> [ 0.250381] CPU2 attaching sched-domain(s):
> [ 0.250565] domain-0: span=1-3 level=NUMA
> [ 0.250710] groups: 2:{ span=2 cap=1002 }, 3:{ span=3 cap=999 },
> 1:{ span=1 cap=1008 }
> [ 0.250992] domain-1: span=0-3 level=NUMA
> [ 0.251129] groups: 2:{ span=1-3 mask=2 cap=3025 }, 0:{ span=0-1
> mask=0 cap=2019 }
>
> [ 0.251474] CPU3 attaching sched-domain(s):
> [ 0.251606] domain-0: span=2-3 level=NUMA
> [ 0.251772] groups: 3:{ span=3 cap=999 }, 2:{ span=2 cap=1002 }
> [ 0.252439] domain-1: span=1-3 level=NUMA
> [ 0.252587] groups: 3:{ span=2-3 mask=3 cap=2023 }, 1:{ span=0-2
> mask=1 cap=3034 }
> [ 0.252859] ERROR: groups don't span domain->span
> [ 0.253009] domain-2: span=0-3 level=NUMA
> [ 0.253153] groups: 3:{ span=1-3 mask=3 cap=3047 }, 0:{ span=0-2
> mask=0 cap=3021 }
>
> Why we get this
> ---------------
>
> The sched_domains we get here are NUMA<=20, NUMA<=30, NUMA<=40.
>
> There's no problem with the sched_domain spans, those are directly derived
> from sched_domains_numa_masks[][] and that works just fine. We then build
> the sched_groups using those spans, see
>
> build_overlap_sched_groups()
>
> Let's have a look at what happens for the first domain on CPU0 (node0) and
> CPU2 (node2). This would be NUMA<=20 (i.e. 1 hop), whose groups are the
> spanned CPUs:
>
> CPU0 (node0)
> NUMA<=20 span=[0-1], groups=(0), (1)
>
> CPU2 (node2)
> NUMA<=20 span=[1-3], groups=(2), (3), (1)
>
> So far so good, nothing outlandish here. However, when we get to building
> NUMA<=30 (i.e. 2 hops), this falls apart:
>
> CPU0
> NUMA<=30 span=[0-2], groups=(0-1), (1-3) <<< sched_domain_span(CPU2,
> NUMA<=20)
> ^^^
> sched_domain_span(CPU0, NUMA<=20)
>
> CPU3 (node3) is very much *not* <= 30 distance (2 hops) away from CPU0
> (node0), but we have a group that spans it anyway, because of the
> aforementioned code. IOW, we are creating a completely bogus transitivity
> relation that sort of looks like:
>
> For any nodes i, j, k, and hop count x > 1:
> node_distance(i, j) <= sched_domains_numa_distance[x]
> AND
> node_distance(j, k) <= sched_domains_numa_distance[x-1]
> IMPLIES
> node_distance(i, k) <= sched_domains_numa_distance[x]
>
> Which becomes in our example:
>
> node_distance(0, 2) <= sched_domains_numa_distance[2] # 30
> AND
> node_distance(2, 3) <= sched_domains_numa_distance[1] # 20
> IMPLIES (bogusly!)
> node_distance(0, 3) <= sched_domains_numa_distance[2] # 30
>
> This actually works for diameters == 2 (given everything is covered by 2 hops
> anyway) but falls apart for bigger diameters, as is the case here.
>
> Implications of fixing this
> ---------------------------
>
> Clearly the current sched_group setup for such topologies is not what we
> want: this disturbs load balancing on the 'corrupted' domains.
>
> If we *do* want to support systems like this, then we have other problems to
> solve. Currently, on the aforementioned QEMU setup, we have:
>
> CPU0-domain1
> group0: span=0-2, mask=0
> group2: span=1-3, mask=2

Your kernel log is:
[ 0.246998] domain-1: span=0-2 level=NUMA
[ 0.247145] groups: 0:{ span=0-1 mask=0 cap=2019 }, 2:{ span=1-3
mask=2 cap=3025 }

it seems group0 should be:
group0: span=0-1, mask=0

but not
group0: span=0-2, mask=0 ?

is it a typo here?

> CPU1-domain1
> group1: span=0-2, mask=1
> group3: span=2-3, mask=3
> CPU2-domain1
> group2: span=1-3, mask=2
> group0: span=0-1, mask=0
> CPU3-domain1
> group3: span=2-3, mask=3
> group1: span=0-2, mask=1
>
> We would want to "fix" this into:
>
> CPU0-domain1
> group0: span=0-2, mask=0
> - group2: span=1-3, mask=2
> + group?: span=1-2, mask=??

It seems sensible to move span 3 out of the groups of cpu0-domain1.

> CPU1-domain1
> group1: span=0-2, mask=1
> group3: span=2-3, mask=3
> CPU2-domain1
> group2: span=1-3, mask=2
> group0: span=0-1, mask=0
> CPU3-domain1
> group3: span=2-3, mask=3
> - group1: span=0-2, mask=1
> + group?: span=1-2, mask=??

Again, it seems sensible to move span 0 out of the group of cpu3-domain1.
>
> I've purposedly left a blank for the balance mask, because there isn't really any
> valid choice. No CPU will have this (1-2) group as its local group at domain1
> (NUMA<=30); worse than that, we actually have now 5 unique groups at this
> topology level despite having only 4 CPUs: there are no CPUs left to use as
> identifier of that "new" group (i.e. owner of sched_group_capacity).
>
> Even if we find some form of unique identifier for that group, it still isn't the
> local group of any CPU at any topology level. Since
> update_group_capacity() only updates the capacity of the local group, we
> don't have any way to update the capacity of such groups - unless we start
> updating more than the local group, which will get ugly real quick.
>
>
> That's as far as I got to. If you got to here, thank you for your time, and as
> always, comments / feedback are most welcome.

What is the real harm you have seen in load balance? Have you even tried
to make cpu0,cpu1,cpu2 busy and wake-up more tasks in cpu0? Will cpu3
join to share the loading in your qemu topology?

>
> Links
> =====
>
> [1]:
> https://lkml.kernel.org/r/20200324125533.17447-1-valentin.schneider@arm.
> com/
> [2]: https://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf
>

Thanks
Barry