Re: [RFC v3] sched/topology: fix kernel crash when a CPU is hotplugged in a memoryless node
From: Peter Zijlstra
Date: Mon Mar 18 2019 - 07:27:17 EST
On Mon, Mar 18, 2019 at 04:17:30PM +0530, Srikar Dronamraju wrote:
> > > node 0 (because firmware doesn't provide the distance information for
> > > memoryless/cpuless nodes):
> > >
> > > node 0 1 2 3
> > > 0: 10 40 10 10
> > > 1: 40 10 40 40
> > > 2: 10 40 10 10
> > > 3: 10 40 10 10
> >
> > *groan*... what does it do for things like percpu memory? ISTR the
> > per-cpu chunks are all allocated early too. Having them all use memory
> > out of node-0 would seem sub-optimal.
>
> In the specific failing case, there is only one node with memory; all other
> nodes are cpu only nodes.
>
> However in the generic case since its just a cpu hotplug ops, the memory
> allocated for per-cpu chunks allocated early would remain.
What do you do in the case where there's multiple nodes with memory, but
only one with CPUs on?
Do you then still allocate the per-cpu memory for the CPUs that will
appear on that second node on node0?
> > > We should have:
> > >
> > > node 0 1 2 3
> > > 0: 10 40 40 40
> > > 1: 40 10 40 40
> > > 2: 40 40 10 40
> > > 3: 40 40 40 10
> >
> > Can it happen that it introduces a new distance in the table? One that
> > hasn't been seen before? This example only has 10 and 40, but suppose
> > the new node lands at distance 20 (or 80); can such a thing happen?
> >
> > If not; why not?
>
> Yes distances can be 20, 40 or 80. There is nothing that makes the node
> distance to be 40 always.
This,
> > So you're relying on sched_domain_numa_masks_set/clear() to fix this up,
> > but that in turn relies on the sched_domain_numa_levels thing to stay
> > accurate.
> >
> > This all seems very fragile and unfortunate.
> >
>
> Any reasons why this is fragile?
breaks that patch. The code assumes all the numa distances are known at
boot. If you add distances later, it comes unstuck.
It's not like you're actually changing the interconnects around at
runtime. Node topology really should be known at boot time.
What I _think_ the x86 BIOS does is, for each empty socket, iterate as
many logical CPUs (non-present) as it finds on Socket-0 (or whatever
socket is the boot socket).
Those non-present CPUs are assigned to their respective nodes. And
if/when a physical CPU is placed on the socket and the CPUs onlined, it
all 'works' (see ACPI SRAT).
I'm not entirely sure what happens on x86 when it boots with say a
10-core part and you then fill an empty socket with a 20-core part, I
suspect we simply will not use more than 10, we'll not have space
reserved in the Linux cpumasks for them anyway.