Re: [PATCH v2] arm64: smp: smp_send_stop() and crash_smp_send_stop() should try non-NMI first

From: Yu Zhao
Date: Mon Aug 05 2024 - 13:26:57 EST


On Mon, Aug 5, 2024 at 11:13 AM Doug Anderson <dianders@xxxxxxxxxxxx> wrote:
>
> Hi,
>
> On Mon, Aug 5, 2024 at 9:53 AM Will Deacon <will@xxxxxxxxxx> wrote:
> >
> > Hi Doug,
> >
> > On Tue, Jun 25, 2024 at 04:07:22PM -0700, Douglas Anderson wrote:
> > > @@ -1084,79 +1088,87 @@ static inline unsigned int num_other_online_cpus(void)
> > >
> > > void smp_send_stop(void)
> > > {
> > > + static unsigned long stop_in_progress;
> > > + cpumask_t mask;
> > > unsigned long timeout;
> > >
> > > - if (num_other_online_cpus()) {
> > > - cpumask_t mask;
> > > + /*
> > > + * If this cpu is the only one alive at this point in time, online or
> > > + * not, there are no stop messages to be sent around, so just back out.
> > > + */
> > > + if (num_other_online_cpus() == 0)
> > > + goto skip_ipi;
> > >
> > > - cpumask_copy(&mask, cpu_online_mask);
> > > - cpumask_clear_cpu(smp_processor_id(), &mask);
> > > + /* Only proceed if this is the first CPU to reach this code */
> > > + if (test_and_set_bit(0, &stop_in_progress))
> > > + return;
> > >
> > > - if (system_state <= SYSTEM_RUNNING)
> > > - pr_crit("SMP: stopping secondary CPUs\n");
> > > - smp_cross_call(&mask, IPI_CPU_STOP);
> > > - }
> > > + cpumask_copy(&mask, cpu_online_mask);
> > > + cpumask_clear_cpu(smp_processor_id(), &mask);
> > >
> > > - /* Wait up to one second for other CPUs to stop */
> > > + if (system_state <= SYSTEM_RUNNING)
> > > + pr_crit("SMP: stopping secondary CPUs\n");
> > > +
> > > + /*
> > > + * Start with a normal IPI and wait up to one second for other CPUs to
> > > + * stop. We do this first because it gives other processors a chance
> > > + * to exit critical sections / drop locks and makes the rest of the
> > > + * stop process (especially console flush) more robust.
> > > + */
> > > + smp_cross_call(&mask, IPI_CPU_STOP);
> >
> > I realise you've moved this out of crash_smp_send_stop() and it looks
> > like we inherited the code from x86, but do you know how this serialise
> > against CPU hotplug operations? I've spent the last 20m looking at the
> > code and I can't see what prevents other CPUs from coming and going
> > while we're trying to IPI a non-atomic copy of 'cpu_online_mask'.
>
> I don't think there is anything. ...and it's not just this code
> either.

I agree -- I noticed this a while ago [1], and I added
cpus_read_lock/unlock() on the caller side, because having the locks
inside callees can be a problem for callers who can't sleep.

[1] https://lore.kernel.org/linux-mm/ZnkGps-vQbiynNwP@xxxxxxxxxx/

Also, do the handlers always see `crash_stop` set by the sender? If
not, would it be a problem? (In the above link, it has to do extra
work to make sure that handlers eventually see any updated values.)




> It sure looks like nmi_trigger_cpumask_backtrace() has the
> same problem.
>
> I guess maybe in the case of nmi_trigger_cpumask_backtrace() it's not
> such a big deal because:
> 1. If a CPU goes away then we'll just time out
> 2. If a CPU shows up then we'll skip backtracing it, but we were
> sending backtraces at an instant in time anyway.
>
> In the case of smp_send_stop() it's probably fine if a CPU goes away
> because, again, we'll just timeout. ...but if a CPU shows up then
> that's not super ideal. Maybe it doesn't cause problems in practice
> but it does feel like it should be fixed.
>
> -Doug