[PATCH AUTOSEL 6.1 7/8] rcu: Fix rcu_barrier() VS post CPUHP_TEARDOWN_CPU invocation

From: Sasha Levin
Date: Sat Jul 27 2024 - 20:55:21 EST


From: Frederic Weisbecker <frederic@xxxxxxxxxx>

[ Upstream commit 55d4669ef1b76823083caecfab12a8bd2ccdcf64 ]

When rcu_barrier() calls rcu_rdp_cpu_online() and observes a CPU off
rnp->qsmaskinitnext, it means that all accesses from the offline CPU
preceding the CPUHP_TEARDOWN_CPU are visible to RCU barrier, including
callbacks expiration and counter updates.

However interrupts can still fire after stop_machine() re-enables
interrupts and before rcutree_report_cpu_dead(). The related accesses
happening between CPUHP_TEARDOWN_CPU and rnp->qsmaskinitnext clearing
are _NOT_ guaranteed to be seen by rcu_barrier() without proper
ordering, especially when callbacks are invoked there to the end, making
rcutree_migrate_callback() bypass barrier_lock.

The following theoretical race example can make rcu_barrier() hang:

CPU 0 CPU 1
----- -----
//cpu_down()
smpboot_park_threads()
//ksoftirqd is parked now
<IRQ>
rcu_sched_clock_irq()
invoke_rcu_core()
do_softirq()
rcu_core()
rcu_do_batch()
// callback storm
// rcu_do_batch() returns
// before completing all
// of them
// do_softirq also returns early because of
// timeout. It defers to ksoftirqd but
// it's parked
</IRQ>
stop_machine()
take_cpu_down()
rcu_barrier()
spin_lock(barrier_lock)
// observes rcu_segcblist_n_cbs(&rdp->cblist) != 0
<IRQ>
do_softirq()
rcu_core()
rcu_do_batch()
//completes all pending callbacks
//smp_mb() implied _after_ callback number dec
</IRQ>

rcutree_report_cpu_dead()
rnp->qsmaskinitnext &= ~rdp->grpmask;

rcutree_migrate_callback()
// no callback, early return without locking
// barrier_lock
//observes !rcu_rdp_cpu_online(rdp)
rcu_barrier_entrain()
rcu_segcblist_entrain()
// Observe rcu_segcblist_n_cbs(rsclp) == 0
// because no barrier between reading
// rnp->qsmaskinitnext and rsclp->len
rcu_segcblist_add_len()
smp_mb__before_atomic()
// will now observe the 0 count and empty
// list, but too late, we enqueue regardless
WRITE_ONCE(rsclp->len, rsclp->len + v);
// ignored barrier callback
// rcu barrier stall...

This could be solved with a read memory barrier, enforcing the message
passing between rnp->qsmaskinitnext and rsclp->len, matching the full
memory barrier after rsclp->len addition in rcu_segcblist_add_len()
performed at the end of rcu_do_batch().

However the rcu_barrier() is complicated enough and probably doesn't
need too many more subtleties. CPU down is a slowpath and the
barrier_lock seldom contended. Solve the issue with unconditionally
locking the barrier_lock on rcutree_migrate_callbacks(). This makes sure
that either rcu_barrier() sees the empty queue or its entrained
callback will be migrated.

Signed-off-by: Frederic Weisbecker <frederic@xxxxxxxxxx>
Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>
---
kernel/rcu/tree.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 61f9503a5fe9c..cd6144cea5a1a 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4391,11 +4391,15 @@ void rcutree_migrate_callbacks(int cpu)
struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
bool needwake;

- if (rcu_rdp_is_offloaded(rdp) ||
- rcu_segcblist_empty(&rdp->cblist))
- return; /* No callbacks to migrate. */
+ if (rcu_rdp_is_offloaded(rdp))
+ return;

raw_spin_lock_irqsave(&rcu_state.barrier_lock, flags);
+ if (rcu_segcblist_empty(&rdp->cblist)) {
+ raw_spin_unlock_irqrestore(&rcu_state.barrier_lock, flags);
+ return; /* No callbacks to migrate. */
+ }
+
WARN_ON_ONCE(rcu_rdp_cpu_online(rdp));
rcu_barrier_entrain(rdp);
my_rdp = this_cpu_ptr(&rcu_data);
--
2.43.0