Re: [PATCH 22/25] x86/mcheck: Do the init in one place

From: Sebastian Andrzej Siewior
Date: Wed Nov 09 2016 - 12:23:08 EST


On 2016-11-09 18:01:09 [+0100], Borislav Petkov wrote:
> On Wed, Nov 09, 2016 at 05:24:51PM +0100, Sebastian Andrzej Siewior wrote:
> > The behaviour was not changed - it was only reorganized to keep in one
> > spot.
>
> So there's the full CPU init path down cpu_up() -> ... -> identify_cpu()
> where mcheck_cpu_init() is called and then there's also the hotplug
> callbacks in mce_cpu_callback().

correct. And it looks like mcheck_cpu_init() is part of the notifier. It
tried to allocate memory which it only did on boot CPU.

> What you're proposing now is, merge the two.

I wouldn't call it a merge. identify_cpu() is one of the first things a
CPU does on its boot. Slightly later invokes the `old' CPU_STARTING
notifier callbacks. I had nothing what would match CPU_STARTING except
mcheck_cpu_init() so I took it and renamed to mcheck_cpu_starting().
The remaining bring up states in mce_cpu_callback() are just CPU_ONLINE
which is also converted away.

> But then the full path down identify_cpu() could still do
> mheck_cpu_init() regardless where you move it.
>
> IOW, I still don't see why this change is needed.

>From what it looks, is that mce_cpu_down_dying() is the counter part to
mcheck_cpu_starting()/mcheck_cpu_init(). One hint for it is that the
latter disable the timer which is started by the former. So if you
rollback to online after the DYNING state you need to do STARTING in
order to start the timer.

> In more OW, why can't you simply do:
>
> err = cpuhp_setup_state(CPUHP_AP_X86_MCE_STARTING, "x86/mce:starting",
> mce_reenable_cpu, NULL);

A later patch does:
err = cpuhp_setup_state(CPUHP_AP_X86_MCE_STARTING, "x86/mce:starting",
mcheck_cpu_starting, mce_cpu_down_dying);

> and us> e the current notifier callback?
>
> I still don't get the need for this churn.
I tried to clarify it with the timer. You claimed that the MCE feature
itself is not disabled if the CPU goes down. I assumed that this is the
case (somewhere in mce_disable_cpu()) because a CPU which is off should
not response to any MCE errors and then it would be enabled in the
STARTING callback again. This (and timer) was the motivation to get the
MCE stuff in one place and get symmetrical CPU down/up behaviour.

Sebastian