Re: [Fix PATCH] cpu/hotplug: Fix bug report when add "nosmt" parameter with CONFIG_HOTPLUG_CPU=N

From: Thomas Gleixner
Date: Mon Mar 25 2019 - 10:17:15 EST


Tianyu,

On Mon, 25 Mar 2019, lantianyu1986@xxxxxxxxx wrote:

thanks for tracking this down.

A few nit picks vs. the subject first:

> Subject: [Fix PATCH] cpu/hotplug: Fix bug report when add "nosmt" parameter with CONFIG_HOTPLUG_CPU=N

[PATCH] is sufficient. The extra 'Fix' has no real value. 'Fix bug report'
is weird. Bug reports cannot be fixed, only bugs can be fixed. Also please
avoid overly long subjects. Subjects should be concise and describe the
problem/change as precise as it goes.

> When add "nosmt" parameter, kernel still boots up all logical cpus once
> and set CR4.MCE on each CPU. This is to avoid shutting down machine
> when a broadacasted MCE is observed CR4.MCE=0b. (Detail please see comment
> in the cpu_smt_allowed()). Smt cpus will bring up and bring down during
> kernel boot with "nosmt" parameter.
>
> When CONFIG_HOTPLUG_CPU=Y, CPU_DYING callbacks will be called inside
> stop-machine and irq is disabled. This happens in the take_cpu_down()
> callback. When CONFIG_HOTPLUG_CPU=N,CPU_DYING callbacks will be called
> with irq enabled.

Yes, that's bad.

> smpcfd_dying_cpu() is one of CPU_DYING callbacks and it assumes to be
> called when irq is disabled. smpcfd_dying_cpu() calls flush_smp_call_
> function_queue() which requires to be called with irq disabled.
>
> When CONFIG_HOTPLUG_CPU=N and add "nosmt" parameter, smpcfd_dying_cpu()
> is called with irq enalbed and this triggers BUG_ON(!irqs_disabled())
> in the irq_work_run_list(). This patch is to fix the issue.

That's not a fix, it's papering over the root cause. As you figured out
already, the DYING callbacks should be called with interrupts disabled.

So rather than "fixing" that particular callback we need to look at the
reason why these callbacks are invoked with interrupts enabled and fix
that.

> Fixes: 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")

That has nothing to do with 'nosmt'. It's a general bug in the rollback
code when HOTPLUG_CPU=n. 'nosmt' is using the rollback mechanism and is
just a reliable way to trigger the problem. This happens in the same way
when the bringup of a CPU fails for any other reason. That bug was there
way before 0cc3cd21657b.

I'll have a look, but I fear that needs some surgery to fix.

Thanks,

tglx