Re: [RFC v3] sched/topology: fix kernel crash when a CPU is hotplugged in a memoryless node

From: Laurent Vivier
Date: Fri Mar 15 2019 - 07:12:52 EST


On 04/03/2019 20:59, Laurent Vivier wrote:
> When we hotplug a CPU in a memoryless/cpuless node,
> the kernel crashes when it rebuilds the sched_domains data.
>
> I reproduce this problem on POWER and with a pseries VM, with the following
> QEMU parameters:
>
> -machine pseries -enable-kvm -m 8192 \
> -smp 2,maxcpus=8,sockets=4,cores=2,threads=1 \
> -numa node,nodeid=0,cpus=0-1,mem=0 \
> -numa node,nodeid=1,cpus=2-3,mem=8192 \
> -numa node,nodeid=2,cpus=4-5,mem=0 \
> -numa node,nodeid=3,cpus=6-7,mem=0
>
> Then I can trigger the crash by hotplugging a CPU on node-id 3:
>
> (qemu) device_add host-spapr-cpu-core,core-id=7,node-id=3
>
> Built 2 zonelists, mobility grouping on. Total pages: 130162
> Policy zone: Normal
> WARNING: workqueue cpumask: online intersect > possible intersect
> BUG: Kernel NULL pointer dereference at 0x00000400
> Faulting instruction address: 0xc000000000170edc
> Oops: Kernel access of bad area, sig: 11 [#1]
> LE SMP NR_CPUS=2048 NUMA pSeries
> Modules linked in: ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter xts vmx_crypto ip_tables xfs libcrc32c virtio_net net_failover failover virtio_blk virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod
> CPU: 2 PID: 5661 Comm: kworker/2:0 Not tainted 5.0.0-rc6+ #20
> Workqueue: events cpuset_hotplug_workfn
> NIP: c000000000170edc LR: c000000000170f98 CTR: 0000000000000000
> REGS: c000000003e931a0 TRAP: 0380 Not tainted (5.0.0-rc6+)
> MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 22284028 XER: 00000000
> CFAR: c000000000170f20 IRQMASK: 0
> GPR00: c000000000170f98 c000000003e93430 c0000000011ac500 c0000001efe22000
> GPR04: 0000000000000001 0000000000000000 0000000000000000 0000000000000010
> GPR08: 0000000000000001 0000000000000400 ffffffffffffffff 0000000000000000
> GPR12: 0000000000008800 c00000003fffd680 c0000001f14b0000 c0000000011e1bf0
> GPR16: c0000000011e61f4 c0000001efe22200 c0000001efe22020 c0000001fba80000
> GPR20: c0000001ff567a80 0000000000000001 c000000000e27a80 ffffffffffffe830
> GPR24: ffffffffffffec30 000000000000102f 000000000000102f c0000001efca1000
> GPR28: c0000001efca0400 c0000001efe22000 c0000001efe23bff c0000001efe22a00
> NIP [c000000000170edc] free_sched_groups+0x5c/0xf0
> LR [c000000000170f98] destroy_sched_domain+0x28/0x90
> Call Trace:
> [c000000003e93430] [000000000000102f] 0x102f (unreliable)
> [c000000003e93470] [c000000000170f98] destroy_sched_domain+0x28/0x90
> [c000000003e934a0] [c0000000001716e0] cpu_attach_domain+0x100/0x920
> [c000000003e93600] [c000000000173128] build_sched_domains+0x1228/0x1370
> [c000000003e93740] [c00000000017429c] partition_sched_domains+0x23c/0x400
> [c000000003e937e0] [c0000000001f5ec8] rebuild_sched_domains_locked+0x78/0xe0
> [c000000003e93820] [c0000000001f9ff0] rebuild_sched_domains+0x30/0x50
> [c000000003e93850] [c0000000001fa1c0] cpuset_hotplug_workfn+0x1b0/0xb70
> [c000000003e93c80] [c00000000012e5a0] process_one_work+0x1b0/0x480
> [c000000003e93d20] [c00000000012e8f8] worker_thread+0x88/0x540
> [c000000003e93db0] [c00000000013714c] kthread+0x15c/0x1a0
> [c000000003e93e20] [c00000000000b55c] ret_from_kernel_thread+0x5c/0x80
> Instruction dump:
> 2e240000 f8010010 f821ffc1 409e0014 48000080 7fbdf040 7fdff378 419e0074
> ebdf0000 4192002c e93f0010 7c0004ac <7d404828> 314affff 7d40492d 40c2fff4
> ---[ end trace f992c4a7d47d602a ]---
>
> Kernel panic - not syncing: Fatal exception
>
> This happens in free_sched_groups() because the linked list of the
> sched_groups is corrupted. Here what happens when we hotplug the CPU:
>
> - build_sched_groups() builds a sched_groups linked list for
> sched_domain D1, with only one entry A, refcount=1
>
> D1: A(ref=1)
>
> - build_sched_groups() builds a sched_groups linked list for
> sched_domain D2, with the same entry A
>
> D2: A(ref=2)
>
> - build_sched_groups() builds a sched_groups linked list for
> sched_domain D3, with the same entry A and a new entry B:
>
> D3: A(ref=3) -> B(ref=1)
>
> - destroy_sched_domain() is called for D1:
>
> D1: A(ref=3) -> B(ref=1) and as ref is 1, memory of B is released,
> but A->next always points to B
>
> - destroy_sched_domain() is called for D3:
>
> D3: A(ref=2) -> B(ref=0)
>
> kernel crashes when it tries to use data inside B, as the memory has been
> corrupted as it has been freed, the linked list (next) is broken too.
>
> This problem appears with commit 051f3ca02e46
> ("sched/topology: Introduce NUMA identity node sched domain").
>
> If I compare function calls sequence before and after this commit I can see
> in the working case build_overlap_sched_groups() is called instead of
> build_sched_groups() and in this case the reference counters have all the
> same value and the linked list can be correctly unallocated.
> The involved commit has introduced the node domain, and in the case of
> powerpc the node domain can overlap, whereas it should not happen.
>
> This happens because initially powerpc code computes
> sched_domains_numa_masks of offline nodes as if they were merged with
> node 0 (because firmware doesn't provide the distance information for
> memoryless/cpuless nodes):
>
> node 0 1 2 3
> 0: 10 40 10 10
> 1: 40 10 40 40
> 2: 10 40 10 10
> 3: 10 40 10 10
>
> We should have:
>
> node 0 1 2 3
> 0: 10 40 40 40
> 1: 40 10 40 40
> 2: 40 40 10 40
> 3: 40 40 40 10
>
> And once a new CPU is added, node is onlined, numa masks are updated
> but initial set bits are not cleared. This explains why nodes can overlap.
>
> This patch changes the initial code to not initialize the distance for
> offline nodes. The distances will be updated when node will become online
> (on CPU hotplug) as it is already done.
>
> This patch has been tested on powerpc but not on the other architectures.
> They are impacted because the modified part is in the common code.
> All comments are welcome (how to move the change to powerpc specific code
> or if the other architectures can work with this change).
>
> Fixes: 051f3ca02e46 ("sched/topology: Introduce NUMA identity node sched domain")
> Cc: Suravee Suthikulpanit <suravee.suthikulpanit@xxxxxxx>
> Cc: Srikar Dronamraju <srikar@xxxxxxxxxxxxxxxxxx>
> Cc: Borislav Petkov <bp@xxxxxxx>
> Cc: David Gibson <david@xxxxxxxxxxxxxxxxxxxxx>
> Cc: Michael Ellerman <mpe@xxxxxxxxxxxxxx>
> Cc: Nathan Fontenot <nfont@xxxxxxxxxxxxxxxxxx>
> Cc: Michael Bringmann <mwb@xxxxxxxxxxxxxxxxxx>
> Cc: linuxppc-dev@xxxxxxxxxxxxxxxx
> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Signed-off-by: Laurent Vivier <lvivier@xxxxxxxxxx>
> ---
>
> Notes:
> v3: fix the root cause of the problem (sched numa mask initialization)
> v2: add scheduler maintainers in the CC: list
>
> kernel/sched/topology.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 3f35ba1d8fde..24831b86533b 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1622,8 +1622,10 @@ void sched_init_numa(void)
> return;
>
> sched_domains_numa_masks[i][j] = mask;
> + if (!node_state(j, N_ONLINE))
> + continue;
>
> - for_each_node(k) {
> + for_each_online_node(k) {
> if (node_distance(j, k) > sched_domains_numa_distance[i])
> continue;
>
>

Another way to avoid the nodes overlapping for the offline nodes at
startup is to ensure the default values don't define a distance that
merge all offline nodes into node 0.

A powerpc specific patch can workaround the kernel crash by doing this:

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 87f0dd0..3ba29bb 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -623,6 +623,7 @@ static int __init parse_numa_properties(void)
struct device_node *memory;
int default_nid = 0;
unsigned long i;
+ int nid, dist;

if (numa_enabled == 0) {
printk(KERN_WARNING "NUMA disabled by user\n");
@@ -636,6 +637,10 @@ static int __init parse_numa_properties(void)

dbg("NUMA associativity depth for CPU/Memory: %d\n",
min_common_depth);

+ for (nid = 0; nid < MAX_NUMNODES; nid ++)
+ for (dist = 0; dist < MAX_DISTANCE_REF_POINTS; dist++)
+ distance_lookup_table[nid][dist] = nid;
+
/*
* Even though we connect cpus to numa domains later in SMP
* init, we need to know the node ids now. This is because

Any comment?

If this is not the good way to do, does someone have a better idea how
to fix the kernel crash?

Thanks,
Laurent