Topology updates and NUMA-level sched domains

From: Nishanth Aravamudan
Date: Mon Apr 06 2015 - 17:46:10 EST


Hi Peter,

As you are very aware, I think, power has some odd NUMA topologies (and
changes to the those topologies) at run-time. In particular, we can see
a topology at boot:

Node 0: all Cpus
Node 7: no cpus

Then we get a notification from the hypervisor that a core (or two) have
moved from node 0 to node 7. This results in the:

[ 64.496687] BUG: arch topology borken
[ 64.496689] the CPU domain not a subset of the NUMA domain

messages for each moved CPU. I think this is because when we first came
up, we degrade (elide altogether?) the NUMA domain for node 7 as it has
no CPUs:

[ 0.305823] CPU0 attaching sched-domain:
[ 0.305831] domain 0: span 0-7 level SIBLING
[ 0.305834] groups: 0 (cpu_power = 146) 1 (cpu_power = 146) 2
(cpu_power = 146) 3 (cpu_power = 146) 4 (cpu_power = 146) 5 (cpu_power =
146) 6 (cpu_power = 146) 7 (cpu_power = 146)
[ 0.305854] domain 1: span 0-79 level CPU
[ 0.305856] groups: 0-7 (cpu_power = 1168) 8-15 (cpu_power = 1168)
16-23 (cpu_power = 1168) 24-31 (cpu_power = 1168) 32-39 (cpu_power =
1168) 40-47 (cpu_power = 1168) 48-55 (cpu_power = 1168) 56-63 (cpu_power
= 1168) 64-71 (cpu_power = 1168) 72-79 (cpu_power = 1168)

For those cpus that moved, we get after the update:

[ 64.505819] CPU8 attaching sched-domain:
[ 64.505821] domain 0: span 8-15 level SIBLING
[ 64.505823] groups: 8 (cpu_power = 147) 9 (cpu_power = 147) 10
(cpu_power = 147) 11 (cpu_power = 146) 12 (cpu_power = 147) 13
(cpu_power = 147) 14 (cpu_power = 146) 15 (cpu_power = 147)
[ 64.505842] domain 1: span 8-23,72-79 level CPU
[ 64.505845] groups: 8-15 (cpu_power = 1174) 16-23 (cpu_power =
1175) 72-79 (cpu_power = 1176)

while the non-modified CPUs report, correctly:

[ 64.497186] CPU0 attaching sched-domain:
[ 64.497189] domain 0: span 0-7 level SIBLING
[ 64.497192] groups: 0 (cpu_power = 147) 1 (cpu_power = 147) 2
(cpu_power = 146) 3 (cpu_power = 147) 4 (cpu_power = 147) 5 (cpu_power =
147) 6 (cpu_power = 147) 7 (cpu_power = 146)
[ 64.497213] domain 1: span 0-7,24-71 level CPU
[ 64.497215] groups: 0-7 (cpu_power = 1174) 24-31 (cpu_power =
1173) 32-39 (cpu_power = 1176) 40-47 (cpu_power = 1175) 48-55 (cpu_power
= 1176) 56-63 (cpu_power = 1175) 64-71 (cpu_power = 1174)
[ 64.497234] domain 2: span 0-79 level NUMA
[ 64.497236] groups: 0-7,24-71 (cpu_power = 8223) 8-23,72-79
(cpu_power = 3525)

It seems like we might need something like this (HORRIBLE HACK, I know,
just to get discussion):

@@ -6958,6 +6960,10 @@ void partition_sched_domains(int ndoms_new,
cpumask_var_t doms_new[],

/* Let architecture update cpu core mappings. */
new_topology = arch_update_cpu_topology();
+ /* Update NUMA topology lists */
+ if (new_topology) {
+ sched_init_numa();
+ }

n = doms_new ? ndoms_new : 0;

or a re-init API (which won't try to reallocate various bits), because
the topology could be completely different now (e.g.,
sched_domains_numa_distance will also be inaccurate now). Really, a
topology update on power (not sure on s390x, but those are the only two
archs that return a positive value from arch_update_cpu_topology() right
now, afaics) is a lot like a hotplug event and we need to re-initialize
any dependent structures.

I'm just sending out feelers, as we can limp by with the above warning,
it seems, but is less than ideal. Any help or insight you could provide
would be greatly appreciated!

-Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/