[PATCH v2 0/2] Skip numa distance for offline nodes

From: Srikar Dronamraju
Date: Thu Jul 01 2021 - 00:17:00 EST


Changelog v1->v2:
v1: http://lore.kernel.org/lkml/20210520154427.1041031-1-srikar@xxxxxxxxxxxxxxxxxx/t/#u
- Update the numa masks, whenever 1st CPU is added to cpuless node
- Populate all possible nodes distances in boot in a
powerpc specific function

Geetika reported yet another trace while doing a dlpar CPU add
operation. This was true even on top of a recent commit
6980d13f0dd1 ("powerpc/smp: Set numa node before updating mask") which fixed
a similar trace.

WARNING: CPU: 40 PID: 2954 at kernel/sched/topology.c:2088 build_sched_domains+0x6e8/0x1540
Modules linked in: nft_counter nft_compat rpadlpar_io rpaphp mptcp_diag
xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag
netlink_diag bonding tls nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat
nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set rfkill nf_tables
nfnetlink dm_multipath pseries_rng xts vmx_crypto binfmt_misc ip_tables xfs
libcrc32c sd_mod t10_pi sg ibmvscsi ibmveth scsi_transport_srp dm_mirror
dm_region_hash dm_log dm_mod fuse
CPU: 40 PID: 2954 Comm: kworker/40:0 Not tainted 5.13.0-rc1+ #19
Workqueue: events cpuset_hotplug_workfn
NIP: c0000000001de588 LR: c0000000001de584 CTR: 00000000006cd36c
REGS: c00000002772b250 TRAP: 0700 Not tainted (5.12.0-rc5-master+)
MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 28828422 XER: 0000000d
CFAR: c00000000020c2f8 IRQMASK: 0 #012GPR00: c0000000001de584 c00000002772b4f0
c000000001f55400 0000000000000036 #012GPR04: c0000063c6368010 c0000063c63f0a00
0000000000000027 c0000063c6368018 #012GPR08: 0000000000000023 c0000063c636ef48
00000063c4de0000 c0000063bfe9ffe8 #012GPR12: 0000000028828424 c0000063fe68fe80
0000000000000000 0000000000000417 #012GPR16: 0000000000000028 c00000000740dcd8
c00000000205db68 c000000001a3a4a0 #012GPR20: c000000091ed7d20 c000000091ed8520
0000000000000001 0000000000000000 #012GPR24: c0000000113a9600 0000000000000190
0000000000000028 c0000000010e3ac0 #012GPR28: 0000000000000000 c00000000740dd00
c0000000317b5900 0000000000000190
NIP [c0000000001de588] build_sched_domains+0x6e8/0x1540
LR [c0000000001de584] build_sched_domains+0x6e4/0x1540
Call Trace:
[c00000002772b4f0] [c0000000001de584] build_sched_domains+0x6e4/0x1540 (unreliable)
[c00000002772b640] [c0000000001e08dc] partition_sched_domains_locked+0x3ec/0x530
[c00000002772b6e0] [c0000000002a2144] rebuild_sched_domains_locked+0x524/0xbf0
[c00000002772b7e0] [c0000000002a5620] rebuild_sched_domains+0x40/0x70
[c00000002772b810] [c0000000002a58e4] cpuset_hotplug_workfn+0x294/0xe20
[c00000002772bc30] [c000000000187510] process_one_work+0x300/0x670
[c00000002772bd10] [c0000000001878f8] worker_thread+0x78/0x520
[c00000002772bda0] [c0000000001937f0] kthread+0x1a0/0x1b0
[c00000002772be10] [c00000000000d6ec] ret_from_kernel_thread+0x5c/0x70
Instruction dump:
7ee5bb78 7f0ac378 7f29cb78 7f68db78 7f46d378 7f84e378 f8610068 3c62ff19
fbe10060 3863e558 4802dd31 60000000 <0fe00000> 3920fff4 f9210080 e86100b0

Detailed analysis of the failing scenario showed that the span in
question belongs to NODE domain and further the cpumasks for some
cpus in NODE overlapped. There are two possible reasons how we ended
up here:

(1) The numa node was offline or blank with no CPUs or memory. Hence
the sched_max_numa_distance could not be set correctly, or the
sched_domains_numa_distance happened to be partially populated.

(2) Depending on a bogus node_distance of an offline node to populate
cpumasks is the issue. On POWER platform the node_distance is
correctly available only for an online node which has some CPU or
memory resource associated with it.

For example distance info from numactl from a fully populated 8 node
system at boot may look like this.

node distances:
node 0 1 2 3 4 5 6 7
0: 10 20 40 40 40 40 40 40
1: 20 10 40 40 40 40 40 40
2: 40 40 10 20 40 40 40 40
3: 40 40 20 10 40 40 40 40
4: 40 40 40 40 10 20 40 40
5: 40 40 40 40 20 10 40 40
6: 40 40 40 40 40 40 10 20
7: 40 40 40 40 40 40 20 10

However the same system when only two nodes are online at boot, then the
numa topology will look like
node distances:
node 0 1
0: 10 20
1: 20 10

This series tries to fix both these problems.
Note: These problems are now visible, thanks to
Commit ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't
(partially) overlap")

Cc: LKML <linux-kernel@xxxxxxxxxxxxxxx>
Cc: linuxppc-dev@xxxxxxxxxxxxxxxx
Cc: Nathan Lynch <nathanl@xxxxxxxxxxxxx>
Cc: Michael Ellerman <mpe@xxxxxxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Valentin Schneider <valentin.schneider@xxxxxxx>
Cc: Gautham R Shenoy <ego@xxxxxxxxxxxxxxxxxx>
Cc: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>
Cc: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
Cc: Rik van Riel <riel@xxxxxxxxxxx>
Cc: Geetika Moolchandani <Geetika.Moolchandani1@xxxxxxx>
Cc: Laurent Dufour <ldufour@xxxxxxxxxxxxx>

Srikar Dronamraju (2):
sched/topology: Skip updating masks for non-online nodes
powerpc/numa: Fill distance_lookup_table for offline nodes

arch/powerpc/mm/numa.c | 70 +++++++++++++++++++++++++++++++++++++++++
kernel/sched/topology.c | 25 +++++++++++++--
2 files changed, 93 insertions(+), 2 deletions(-)


base-commit: 031e3bd8986fffe31e1ddbf5264cccfe30c9abd7
--
2.27.0