Re: [PATCH 2/2] x86/numa: instance all parsed numa node

From: Thomas Gleixner
Date: Mon Jul 08 2019 - 05:37:00 EST


On Mon, 8 Jul 2019, Pingfan Liu wrote:
> On Mon, Jul 8, 2019 at 3:44 AM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> >
> > On Fri, 5 Jul 2019, Pingfan Liu wrote:
> >
> > > I hit a bug on an AMD machine, with kexec -l nr_cpus=4 option. nr_cpus option
> > > is used to speed up kdump process, so it is not a rare case.
> >
> > But fundamentally wrong, really.
> >
> > The rest of the CPUs are in a half baken state and any broadcast event,
> > e.g. MCE or a stray IPI, will result in a undiagnosable crash.
> Very appreciate if you can pay more word on it? I tried to figure out
> your point, but fail.
>
> For "a half baked state", I think you concern about LAPIC state, and I
> expand this point like the following:

It's not only the APIC state. It's the state of the CPUs in general.

> For IPI: when capture kernel BSP is up, the rest cpus are still loop
> inside crash_nmi_callback(), so there is no way to eject new IPI from
> these cpu. Also we disable_local_APIC(), which effectively prevent the
> LAPIC from responding to IPI, except NMI/INIT/SIPI, which will not
> occur in crash case.

Fair enough for the IPI case.

> For MCE, I am not sure whether it can broadcast or not between cpus,
> but as my understanding, it can not. Then is it a problem?

It can and it does.

That's the whole point why we bring up all CPUs in the 'nosmt' case and
shut the siblings down again after setting CR4.MCE. Actually that's in fact
a 'let's hope no MCE hits before that happened' approach, but that's all we
can do.

If we don't do that then the MCE broadcast can hit a CPU which has some
firmware initialized state. The result can be a full system lockup, triple
fault etc.

So when the MCE hits a CPU which is still in the crashed kernel lala state,
then all hell breaks lose.

> From another view point, is there any difference between nr_cpus=1 and
> nr_cpus> 1 in crashing case? If stray IPI raises issue to nr_cpus>1,
> it does for nr_cpus=1.

Anything less than the actual number of present CPUs is problematic except
you use the 'let's hope nothing happens' approach. We could add an option
to stop the bringup at the early online state similar to what we do for
'nosmt'.

Thanks,

tglx