Re: [PATCH RESEND 1/1] x86/smpboot: check cpu_initialized_mask first after returning from schedule()

From: Dongli Zhang
Date: Mon Jan 10 2022 - 12:44:43 EST


May I have feedback for this patch? This may mitigate a CPU hotplug issue which
is only recoverable across OS reboot, unless the below patch is available.

https://lore.kernel.org/all/20211206152034.2150770-1-bigeasy@xxxxxxxxxxxxx/

I see there is a patchset that may rework this part. That patch set does not
change the logic here.

https://lore.kernel.org/all/20211215145633.5238-1-dwmw2@xxxxxxxxxxxxx/

Thank you very much!

Dongli Zhang

On 12/23/21 1:03 PM, Dongli Zhang wrote:
> To online a new CPU, the master CPU signals the secondary and waits for at
> most 10 seconds until cpu_initialized_mask is set by the secondary CPU.
> There is a case that the master CPU fails the online when it takes 10+
> seconds for schedule() to return (although the cpu_initialized_mask is
> already set by the secondary CPU).
>
> 1. The master CPU signals the secondary CPU in do_boot_cpu().
>
> 2. As the cpu_initialized_mask is still not set, the master CPU reschedules
> via schedule().
>
> 3. Suppose it takes 10+ seconds until schedule() returns due to performance
> issue. The secondary CPU sets the cpu_initialized_mask during those 10+
> seconds.
>
> 4. Once the schedule() at the master CPU returns, although the
> cpu_initialized_mask is set, the time_before(jiffies, timeout) fails. As a
> result, the master CPU regards this as failure.
>
> [ 51.983296] smpboot: do_boot_cpu failed(-1) to wakeup CPU#4
>
> 5. Since the secondary CPU is stuck at state CPU_UP_PREPARE, any further
> online operation will fail by cpu_check_up_prepare(), unless the below
> patch set is applied.
>
> https://lore.kernel.org/all/20211206152034.2150770-1-bigeasy@xxxxxxxxxxxxx/
>
> This issue is resolved by always first checking whether the secondary CPU
> has set cpu_initialized_mask.
>
> Cc: Longpeng(Mike) <longpeng2@xxxxxxxxxx>
> Cc: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx>
> Cc: Joe Jin <joe.jin@xxxxxxxxxx>
> Signed-off-by: Dongli Zhang <dongli.zhang@xxxxxxxxxx>
> ---
> To resend by Cc linux-kernel@xxxxxxxxxxxxxxx as well.
>
> arch/x86/kernel/smpboot.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 617012f4619f..faad4fcf67eb 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1145,7 +1145,7 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
> */
> boot_error = -1;
> timeout = jiffies + 10*HZ;
> - while (time_before(jiffies, timeout)) {
> + while (true) {
> if (cpumask_test_cpu(cpu, cpu_initialized_mask)) {
> /*
> * Tell AP to proceed with initialization
> @@ -1154,6 +1154,10 @@ static int do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
> boot_error = 0;
> break;
> }
> +
> + if (time_after_eq(jiffies, timeout))
> + break;
> +
> schedule();
> }
> }
>