Re: [PATCH] arm64/smp: Move rcu_cpu_starting() earlier

From: Will Deacon
Date: Fri Nov 06 2020 - 05:38:07 EST


On Thu, Nov 05, 2020 at 09:15:24PM -0500, Qian Cai wrote:
> On Thu, 2020-11-05 at 15:28 -0800, Paul E. McKenney wrote:
> > On Thu, Nov 05, 2020 at 06:02:49PM -0500, Qian Cai wrote:
> > > On Thu, 2020-11-05 at 22:22 +0000, Will Deacon wrote:
> > > > Hmm, this patch has caused a regression in the case that we fail to
> > > > online a CPU because it has incompatible CPU features and so we park it
> > > > in cpu_die_early(). We now get an endless spew of RCU stalls because the
> > > > core will never come online, but is being tracked by RCU. So I'm tempted
> > > > to revert this and live with the lockdep warning while we figure out a
> > > > proper fix.
> > > >
> > > > What's the correct say to undo rcu_cpu_starting(), given that we cannot
> > > > invoke the full hotplug machinery here? Is it correct to call
> > > > rcutree_dying_cpu() on the bad CPU and then rcutree_dead_cpu() from the
> > > > CPU doing cpu_up(), or should we do something else?
> > > It looks to me that rcu_report_dead() does the opposite of
> > > rcu_cpu_starting(),
> > > so lift rcu_report_dead() out of CONFIG_HOTPLUG_CPU and use it there to
> > > rewind,
> > > Paul?
> >
> > Yes, rcu_report_dead() should do the trick. Presumably the earlier
> > online-time CPU-hotplug notifiers are also unwound?
> I don't think that is an issue here. cpu_die_early() set CPU_STUCK_IN_KERNEL,
> and then __cpu_up() will see a timeout waiting for the AP online and then deal
> with CPU_STUCK_IN_KERNEL according. Thus, something like this? I don't see
> anything in rcu_report_dead() depends on CONFIG_HOTPLUG_CPU=y.

Cheers both for suggesting rcu_report_dead().

> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 09c96f57818c..10729d2d6084 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -421,6 +421,8 @@ void cpu_die_early(void)
>
> update_cpu_boot_status(CPU_STUCK_IN_KERNEL);
>
> + rcu_report_dead(cpu);

I think this is in the wrong place, see:

https://lore.kernel.org/r/20201106103602.9849-1-will@xxxxxxxxxx

which seems to fix the problem for me.

Will