Re: 5.11-rc4+git: Shortest NUMA path spans too many nodes

From: Dietmar Eggemann
Date: Thu Jan 21 2021 - 14:23:34 EST


On 21/01/2021 19:21, Valentin Schneider wrote:
> On 21/01/21 19:39, Meelis Roos wrote:
>>> Could you paste the output of the below?
>>>
>>> $ cat /sys/devices/system/node/node*/distance
>>
>> 10 12 12 14 14 14 14 16
>> 12 10 14 12 14 14 12 14
>> 12 14 10 14 12 12 14 14
>> 14 12 14 10 12 12 14 14
>> 14 14 12 12 10 14 12 14
>> 14 14 12 12 14 10 14 12
>> 14 12 14 14 12 14 10 12
>> 16 14 14 14 14 12 12 10
>>
>
> Thanks!
>
>>
>>> Additionally, booting your system with CONFIG_SCHED_DEBUG=y and
>>> appending 'sched_debug' to your cmdline should yield some extra data.
>>
>> [ 0.000000] Linux version 5.11.0-rc4-00015-g45dfb8a5659a (mroos@x4600m2) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.1) #55 SMP Thu Jan 21 19:23:10 EET 2021
>> [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.11.0-rc4-00015-g45dfb8a5659a root=/dev/sda1 ro quiet
>
> This is missing 'sched_debug' to get the extra topology debug prints (yes
> it needs an extra cmdline argument on top of having CONFIG_SCHED_DEBUG=y),
> but I should be able to generate those locally by feeding QEMU the above
> distance table.

Can be recreated with (simplified with only 1 CPU per node):

$ qemu-system-aarch64 -kernel /opt/git/kernel_org/arch/arm64/boot/Image -hda /opt/git/tools/qemu-imgs-manipulator/images/qemu-image-aarch64.img -append 'root=/dev/vda console=ttyAMA0 loglevel=8 sched_debug' -nographic -machine virt,gic-version=max -smp cores=8 -m 512 -cpu cortex-a57 -numa node,cpus=0,nodeid=0 -numa node,cpus=1,nodeid=1, -numa node,cpus=2,nodeid=2, -numa node,cpus=3,nodeid=3, -numa node,cpus=4,nodeid=4, -numa node,cpus=5,nodeid=5, -numa node,cpus=6,nodeid=6, -numa node,cpus=7,nodeid=7, -numa dist,src=0,dst=1,val=12, -numa dist,src=0,dst=2,val=12, -numa dist,src=0,dst=3,val=14, -numa dist,src=0,dst=4,val=14, -numa dist,src=0,dst=5,val=14, -numa dist,src=0,dst=6,val=14, -numa dist,src=0,dst=7,val=16, -numa dist,src=1,dst=2,val=14, -numa dist,src=1,dst=3,val=12, -numa dist,src=1,dst=4,val=14, -numa dist,src=1,dst=5,val=14, -numa dist,src=1,dst=6,val=12, -numa dist,src=1,dst=7,val=14, -numa dist,src=2,dst=3,val=14, -numa dist,src=2,dst=4,val=12, -numa dist,src=2,dst=5,val=12, -numa dist,src=2,dst=6,val=14, -numa dist,src=2,dst=7,val=14, -numa dist,src=3,dst=4,val=12, -numa dist,src=3,dst=5,val=12, -numa dist,src=3,dst=6,val=14, -numa dist,src=3,dst=7,val=14, -numa dist,src=4,dst=5,val=14, -numa dist,src=4,dst=6,val=12, -numa dist,src=4,dst=7,val=14, -numa dist,src=5,dst=6,val=14, -numa dist,src=5,dst=7,val=12, -numa dist,src=6,dst=7,val=12

[ 0.206628] ------------[ cut here ]------------
[ 0.206698] Shortest NUMA path spans too many nodes
[ 0.207119] WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:753 cpu_attach_domain+0x42c/0x87c
[ 0.207176] Modules linked in:
[ 0.207373] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.11.0-rc2-00010-g65bcf072e20e-dirty #81
[ 0.207458] Hardware name: linux,dummy-virt (DT)
[ 0.207584] pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
[ 0.207618] pc : cpu_attach_domain+0x42c/0x87c
[ 0.207646] lr : cpu_attach_domain+0x42c/0x87c
[ 0.207665] sp : ffff800011fcbbf0
[ 0.207679] x29: ffff800011fcbbf0 x28: ffff0000024d8200
[ 0.207735] x27: 0000000000001fef x26: 0000000000001917
[ 0.207755] x25: ffff0000024d8000 x24: 0000000000001917
[ 0.207772] x23: 0000000000000000 x22: ffff800011b69a40
[ 0.207789] x21: ffff0000024d8320 x20: ffff8000116fda80
[ 0.207806] x19: ffff0000024d8000 x18: 0000000000000000
[ 0.207822] x17: 0000000000000000 x16: 00000000bd30d762
[ 0.207838] x15: 0000000000000030 x14: ffffffffffffffff
[ 0.207855] x13: ffff800011b82e08 x12: 00000000000001b9
[ 0.207871] x11: 0000000000000093 x10: ffff800011bdae08
[ 0.207887] x9 : 00000000fffff000 x8 : ffff800011b82e08
[ 0.207922] x7 : ffff800011bdae08 x6 : 0000000000000000
[ 0.207939] x5 : 0000000000000000 x4 : 0000000000000000
[ 0.207955] x3 : 00000000ffffffff x2 : 0000000000000000
[ 0.207972] x1 : 0000000000000000 x0 : ffff000018020000
[ 0.208125] Call trace:
[ 0.208230] cpu_attach_domain+0x42c/0x87c
[ 0.208256] build_sched_domains+0x1238/0x12f4
[ 0.208271] sched_init_domains+0x80/0xb0
[ 0.208283] sched_init_smp+0x30/0x80
[ 0.208299] kernel_init_freeable+0xf4/0x238
[ 0.208313] kernel_init+0x14/0x118
[ 0.208328] ret_from_fork+0x10/0x34
[ 0.208507] ---[ end trace 75cafa7c7d1a3d7e ]---
[ 0.208706] CPU0 attaching sched-domain(s):
[ 0.208756] domain-0: span=0-2 level=NUMA
[ 0.209001] groups: 0:{ span=0 cap=1017 }, 1:{ span=1 cap=1016 }, 2:{ span=2 cap=1015 }
[ 0.209247] domain-1: span=0-6 level=NUMA
[ 0.209280] groups: 0:{ span=0-2 mask=0 cap=3048 }, 3:{ span=1,3-5 mask=3 cap=4073 }, 6:{ span=1,4,6-7 mask=6 cap=4084 }
[ 0.209693] ERROR: groups don't span domain->span
[ 0.209703] domain-2: span=0-7 level=NUMA
[ 0.209722] groups: 0:{ span=0-6 mask=0 cap=7114 }, 7:{ span=1-7 mask=7 cap=7163 }
[ 0.210361] CPU1 attaching sched-domain(s):
[ 0.210376] domain-0: span=0-1,3,6 level=NUMA
[ 0.210411] groups: 1:{ span=1 cap=1016 }, 3:{ span=3 cap=1018 }, 6:{ span=6 cap=1017 }, 0:{ span=0 cap=1017 }
[ 0.210493] domain-1: span=0-7 level=NUMA
[ 0.210511] groups: 1:{ span=0-1,3,6 mask=1 cap=4075 }, 2:{ span=0,2,4-5 mask=2 cap=4070 }, 7:{ span=5-7 mask=7 cap=3067 }
[ 0.210641] CPU2 attaching sched-domain(s):
[ 0.210653] domain-0: span=0,2,4-5 level=NUMA
[ 0.210672] groups: 2:{ span=2 cap=1015 }, 4:{ span=4 cap=1016 }, 5:{ span=5 cap=1015 }, 0:{ span=0 cap=1017 }
[ 0.210752] domain-1: span=0-7 level=NUMA
[ 0.210769] groups: 2:{ span=0,2,4-5 mask=2 cap=4070 }, 3:{ span=1,3-5 mask=3 cap=4073 }, 6:{ span=1,4,6-7 mask=6 cap=4084 }
[ 0.210860] CPU3 attaching sched-domain(s):
[ 0.210870] domain-0: span=1,3-5 level=NUMA
[ 0.210887] groups: 3:{ span=3 cap=1018 }, 4:{ span=4 cap=1016 }, 5:{ span=5 cap=1015 }, 1:{ span=1 cap=1016 }
[ 0.210965] domain-1: span=0-7 level=NUMA
[ 0.210981] groups: 3:{ span=1,3-5 mask=3 cap=4073 }, 6:{ span=1,4,6-7 mask=6 cap=4084 }, 0:{ span=0-2 mask=0 cap=3048 }
[ 0.211109] CPU4 attaching sched-domain(s):
[ 0.211134] domain-0: span=2-4,6 level=NUMA
[ 0.211151] groups: 4:{ span=4 cap=1016 }, 6:{ span=6 cap=1017 }, 2:{ span=2 cap=1015 }, 3:{ span=3 cap=1018 }
[ 0.211229] domain-1: span=0-7 level=NUMA
[ 0.211245] groups: 4:{ span=2-4,6 mask=4 cap=4081 }, 5:{ span=2-3,5,7 mask=5 cap=4082 }, 0:{ span=0-2 mask=0 cap=3048 }
[ 0.211383] CPU5 attaching sched-domain(s):
[ 0.211393] domain-0: span=2-3,5,7 level=NUMA
[ 0.211425] groups: 5:{ span=5 cap=1015 }, 7:{ span=7 cap=1019 }, 2:{ span=2 cap=1015 }, 3:{ span=3 cap=1018 }
[ 0.211506] domain-1: span=0-7 level=NUMA
[ 0.211524] groups: 5:{ span=2-3,5,7 mask=5 cap=4082 }, 6:{ span=1,4,6-7 mask=6 cap=4084 }, 0:{ span=0-2 mask=0 cap=3048 }
[ 0.211618] CPU6 attaching sched-domain(s):
[ 0.211628] domain-0: span=1,4,6-7 level=NUMA
[ 0.211645] groups: 6:{ span=6 cap=1017 }, 7:{ span=7 cap=1019 }, 1:{ span=1 cap=1016 }, 4:{ span=4 cap=1016 }
[ 0.211728] domain-1: span=0-7 level=NUMA
[ 0.211745] groups: 6:{ span=1,4,6-7 mask=6 cap=4084 }, 0:{ span=0-2 mask=0 cap=3048 }, 3:{ span=1,3-5 mask=3 cap=4073 }
[ 0.211855] CPU7 attaching sched-domain(s):
[ 0.211866] domain-0: span=5-7 level=NUMA
[ 0.211884] groups: 7:{ span=7 cap=1019 }, 5:{ span=5 cap=1015 }, 6:{ span=6 cap=1017 }
[ 0.211949] domain-1: span=1-7 level=NUMA
[ 0.211966] groups: 7:{ span=5-7 mask=7 cap=3067 }, 1:{ span=0-1,3,6 mask=1 cap=4075 }, 2:{ span=0,2,4-5 mask=2 cap=4070 }
[ 0.212047] ERROR: groups don't span domain->span
[ 0.212055] domain-2: span=0-7 level=NUMA
[ 0.212072] groups: 7:{ span=1-7 mask=7 cap=7163 }, 0:{ span=0-6 mask=0 cap=7114 }

# cat /sys/devices/system/node/node*/distance
10 12 12 14 14 14 14 16
12 10 14 12 14 14 12 14
12 14 10 14 12 12 14 14
14 12 14 10 12 12 14 14
14 14 12 12 10 14 12 14
14 14 12 12 14 10 14 12
14 12 14 14 12 14 10 12
16 14 14 14 14 12 12 10

The '16' seems to be the culprit. How does such a topo look like?