Re: [PATCH v3 rcu 3/3] rcu: Finer-grained grace-period-end checks in rcu_dump_cpu_stacks()

From: Paul E. McKenney
Date: Tue Oct 29 2024 - 12:29:22 EST


On Tue, Oct 29, 2024 at 02:20:51AM +0000, Cheng-Jui Wang (王正睿) wrote:
> On Mon, 2024-10-28 at 17:22 -0700, Paul E. McKenney wrote:
> > The result is that the current leaf rcu_node structure's ->lock is
> > acquired only if a stack backtrace might be needed from the current CPU,
> > and is held across only that CPU's backtrace. As a result, if there are
>
> After upgrading our device to kernel-6.11, we encountered a lockup
> scenario under stall warning.
> I had prepared a patch to submit, but I noticed that this series has
> already addressed some issues, though it hasn't been merged into the
> mainline yet. So, I decided to reply to this series for discussion on
> how to fix it before pushing. Here is the lockup scenario We
> encountered:
>
> Devices: arm64 with only 8 cores
> One CPU holds rnp->lock in rcu_dump_cpu_stack() while trying to dump
> other CPUs, but it waits for the corresponding CPU to dump backtrace,
> with a 10-second timeout.
>
> __delay()
> __const_udelay()
> nmi_trigger_cpumask_backtrace()
> arch_trigger_cpumask_backtrace()
> trigger_single_cpu_backtrace()
> dump_cpu_task()
> rcu_dump_cpu_stacks() <- holding rnp->lock
> print_other_cpu_stall()
> check_cpu_stall()
> rcu_pending()
> rcu_sched_clock_irq()
> update_process_times()
>
> However, the other 7 CPUs are waiting for rnp->lock on the path to
> report qs.
>
> queued_spin_lock_slowpath()
> queued_spin_lock()
> do_raw_spin_lock()
> __raw_spin_lock_irqsave()
> _raw_spin_lock_irqsave()
> rcu_report_qs_rdp()
> rcu_check_quiescent_state()
> rcu_core()
> rcu_core_si()
> handle_softirqs()
> __do_softirq()
> ____do_softirq()
> call_on_irq_stack()
>
> Since the arm64 architecture uses IPI instead of true NMI to implement
> arch_trigger_cpumask_backtrace(), spin_lock_irqsave disables
> interrupts, which is enough to block this IPI request.
> Therefore, if other CPUs start waiting for the lock before receiving
> the IPI, a semi-deadlock scenario like the following occurs:
>
> CPU0 CPU1 CPU2
> ----- ----- -----
> lock_irqsave(rnp->lock)
> lock_irqsave(rnp->lock)
> <can't receive IPI>
> <send ipi to CPU 1>
> <wait CPU 1 for 10s>
> lock_irqsave(rnp->lock)
> <can't receive IPI>
> <send ipi to CPU 2>
> <wait CPU 2 for 10s>
> ...
>
>
> In our scenario, with 7 CPUs to dump, the lockup takes nearly 70
> seconds, causing subsequent useful logs to be unable to print, leading
> to a watchdog timeout and system reboot.
>
> This series of changes re-acquires the lock after each dump,
> significantly reducing lock-holding time. However, since it still holds
> the lock while dumping CPU backtrace, there's still a chance for two
> CPUs to wait for each other for 10 seconds, which is still too long.
> So, I would like to ask if it's necessary to dump backtrace within the
> spinlock section?
> If not, especially now that lockless checks are possible, maybe it can
> be changed as follows?
>
> - if (!(data_race(rnp->qsmask) & leaf_node_cpu_bit(rnp, cpu)))
> - continue;
> - raw_spin_lock_irqsave_rcu_node(rnp, flags);
> - if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) {
> + if (data_race(rnp->qsmask) & leaf_node_cpu_bit(rnp, cpu)) {
> if (cpu_is_offline(cpu))
> pr_err("Offline CPU %d blocking current GP.\n", cpu);
> else
> dump_cpu_task(cpu);
> }
> }
> - raw_spin_unlock_irqrestore_rcu_node(rnp,
> flags);
>
> Or should this be considered an arm64 issue, and they should switch to
> true NMI, otherwise, they shouldn't use
> nmi_trigger_cpumask_backtrace()?

Thank you for looking into this!

We do assume that nmi_trigger_cpumask_backtrace() uses true NMIs, so,
yes, nmi_trigger_cpumask_backtrace() should use true NMIs, just like
the name says. ;-)

Alternatively, arm64 could continue using nmi_trigger_cpumask_backtrace()
with normal interrupts (for example, on SoCs not implementing true NMIs),
but have a short timeout (maybe a few jiffies?) after which its returns
false (and presumably also cancels the backtrace request so that when
the non-NMI interrupt eventually does happen, its handler simply returns
without backtracing). This should be implemented using atomics to avoid
deadlock issues. This alternative approach would provide accurate arm64
backtraces in the common case where interrupts are enabled, but allow
a graceful fallback to remote tracing otherwise.

Would you be interested in working this issue, whatever solution the
arm64 maintainers end up preferring?

Thanx, Paul