Re: [BUG almost bisected] Splat in dequeue_rt_stack() and build error

From: Paul E. McKenney
Date: Tue Oct 08 2024 - 12:24:20 EST


On Tue, Oct 08, 2024 at 01:11:50PM +0200, Peter Zijlstra wrote:
> On Sun, Oct 06, 2024 at 01:44:53PM -0700, Paul E. McKenney wrote:
>
> > With your patch, I got 24 failures out of 100 TREE03 runs of 18 hours
> > each. The failures were different, though, mostly involving boost
> > failures in which RCU priority boosting didn't actually result in the
> > low-priority readers getting boosted.
>
> Somehow I feel this is progress, albeit very minor :/
>
> > There were also a number of "sched: DL replenish lagged too much"
> > messages, but it looks like this was a symptom of the ftrace dump.
> >
> > Given that this now involves priority boosting, I am trying 400*TREE03
> > with each guest OS restricted to four CPUs to see if that makes things
> > happen more quickly, and will let you know how this goes.

And this does seem to make things happen more quickly, but including
an RCU splat. So...

> > Any other debug I should apply?
>
> The sched_pi_setprio tracepoint perhaps?

I will give it a shot, thank you!

> I've read all the RCU_BOOST and rtmutex code (once again), and I've been
> running pi_stress with --sched id=low,policy=other to ensure the code
> paths in question are taken. But so far so very nothing :/
>
> (Noting that both RCU_BOOST and PI futexes use the same rt_mutex / PI API)
>
> You know RCU_BOOST better than me.. then again, it is utterly weird this
> is apparently affected. I've gotta ask, a kernel with my patch on and
> additionally flipping kernel/sched/features.h:SCHED_FEAT(DELAY_DEQUEUE,
> false) functions as expected?

I will try that after the sched_pi_setprio tracepoint (presumably both).

> One very minor thing I noticed while I read the code, do with as you
> think best...
>
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 1c7cbd145d5e..95061119653d 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -1071,10 +1071,6 @@ static int rcu_boost(struct rcu_node *rnp)
> * Recheck under the lock: all tasks in need of boosting
> * might exit their RCU read-side critical sections on their own.
> */
> - if (rnp->exp_tasks == NULL && rnp->boost_tasks == NULL) {
> - raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> - return 0;
> - }
>
> /*
> * Preferentially boost tasks blocking expedited grace periods.
> @@ -1082,10 +1078,13 @@ static int rcu_boost(struct rcu_node *rnp)
> * expedited grace period must boost all blocked tasks, including
> * those blocking the pre-existing normal grace period.
> */
> - if (rnp->exp_tasks != NULL)
> - tb = rnp->exp_tasks;
> - else
> + tb = rnp->exp_tasks;
> + if (!tb)
> tb = rnp->boost_tasks;
> + if (!tb) {
> + raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> + return 0;
> + }
>
> /*
> * We boost task t by manufacturing an rt_mutex that appears to

Well, it is one line shorter and arguably simpler. It looks equivalent,
or am I missing something? If equivalent, I will leave it to Frederic
and the others, since they likely must live with this longer than I do.

And my next step will be attempting to make rcutorture provoke that RCU
splat more often. In the meantime, please feel free to consider this
to be my bug.

Thanx, Paul