Re: [RFC v3] sched/topology: fix kernel crash when a CPU is hotplugged in a memoryless node
From: Laurent Vivier
Date: Fri Mar 15 2019 - 09:05:10 EST
On 15/03/2019 13:25, Peter Zijlstra wrote:
> On Fri, Mar 15, 2019 at 12:12:45PM +0100, Laurent Vivier wrote:
>
>> Another way to avoid the nodes overlapping for the offline nodes at
>> startup is to ensure the default values don't define a distance that
>> merge all offline nodes into node 0.
>>
>> A powerpc specific patch can workaround the kernel crash by doing this:
>>
>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> index 87f0dd0..3ba29bb 100644
>> --- a/arch/powerpc/mm/numa.c
>> +++ b/arch/powerpc/mm/numa.c
>> @@ -623,6 +623,7 @@ static int __init parse_numa_properties(void)
>> struct device_node *memory;
>> int default_nid = 0;
>> unsigned long i;
>> + int nid, dist;
>>
>> if (numa_enabled == 0) {
>> printk(KERN_WARNING "NUMA disabled by user\n");
>> @@ -636,6 +637,10 @@ static int __init parse_numa_properties(void)
>>
>> dbg("NUMA associativity depth for CPU/Memory: %d\n",
>> min_common_depth);
>>
>> + for (nid = 0; nid < MAX_NUMNODES; nid ++)
>> + for (dist = 0; dist < MAX_DISTANCE_REF_POINTS; dist++)
>> + distance_lookup_table[nid][dist] = nid;
>> +
>> /*
>> * Even though we connect cpus to numa domains later in SMP
>> * init, we need to know the node ids now. This is because
>
> What does that actually do? That is, what does it make the distance
> table look like before and after you bring up the CPUs?
By default the table is full of 0. When a CPU is brought up the value is
read from the device-tree and the table is updated. What I've seen is
this value is common for 2 nodes at a given level if they share the level.
So as the table is initialized with 0, all offline nodes (no memory no
cpu) are merged with node 0.
My fix initializes the table with unique values for each node, so by
default no nodes are mixed.
>
>> Any comment?
>
> Well, I had a few questions here:
>
> 20190305115952.GH32477@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
>
> that I've not yet seen answers to.
I didn't answer because:
- I thought this was not the good way to fix the problem as you said "it
seems very fragile and unfortunate",
- I don't have the answers, I'd really like someone from IBM that knows
well the NUMA part of powerpc answers to these questions... and perhaps
find a better solution.
Thanks,
Laurent