Re: scheduler problems in -next (was: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review)

From: Paul E. McKenney
Date: Wed Aug 02 2023 - 13:50:02 EST


On Wed, Aug 02, 2023 at 10:14:51AM -0700, Linus Torvalds wrote:
> Two quick comments, both of them "this code is a bit odd" rather than
> anything else.

Good to get eyes on this code, so thank you very much!!!

> On Tue, 1 Aug 2023 at 12:11, Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> >
> > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
>
> Why is this file called "tasks.h"?
>
> It's not a header file. It makes no sense. It's full of C code. It's
> included in only one place. It's just _weird_.

You are right, it is weird.

This is a holdover from when I was much more concerned about being
criticized for having #ifdef in a .c file, and pretty much every
line in this file is under some combination or another of #ifdefs.
This concern led to kernel/rcu/tree_plugin.h being set up in this way
back when preemptible RCU was introduced, and for good or for bad I just
kept following that pattern.

We could convert this to a .c file, keep the #ifdefs, drop some instances
of "static", add a bunch of declarations, and maybe (or maybe not) push a
function or two into some .h file for performance/inlining reasons. Me, I
would prefer to leave it alone, but we can certainly change it.

> However, more relevantly:
>
> > + mutex_unlock(&rtp->tasks_gp_mutex);
> > set_tasks_gp_state(rtp, RTGS_WAIT_CBS);
>
> Isn't the tasks_gp_mutex the thing that protects the gp state here?
> Shouldn't it be after setting?

Much of the gp state is protected by being accessed only by the gp
kthread. But there is a window in time where the gp might be driven
directly out of the synchronize_rcu_tasks() call. That window in time
does not have a definite end, so this ->tasks_gp_mutex does the needed
mutual exclusion during the transition of gp processing to the newly
created gp kthread.

> > rcuwait_wait_event(&rtp->cbs_wait,
> > (needgpcb = rcu_tasks_need_gpcb(rtp)),
> > TASK_IDLE);
>
> Also, looking at rcu_tasks_need_gpcb() that is now called outside the
> lock, it does something quite odd.

The state of each callback list is protected by the ->lock field of
the rcu_tasks_percpu structure. Yes, rcu_segcblist_n_cbs() is invoked
int rcu_tasks_need_gpcb() outside of the lock, but it is designed for
lockless use. If it is modified just after the check, then there will
be a later wakeup on the one hand or we will just uselessly acquire that
->lock this one time on the other.

Also, ncbs records the number of callbacks seen in that first loop,
then used later, where its value might be stale. This might result in
a collapse back to single-callback-queue operation and a later expansion
back up. Except that at this point we are still in single-CPU mode, so
there should not be any lock contention, which means that there should
still be but a single callback queue. The transition itself is protected
by ->cbs_gbl_lock.

> At the very top of the function does
>
> for (cpu = 0; cpu < smp_load_acquire(&rtp->percpu_dequeue_lim); cpu++) {
>
> and 'smp_load_acquire()' is all about saying "everything *after* this
> load is ordered,
>
> But the way it is done in that loop, it is indeed done at the
> beginning of the loop, but then it's done *after* the loop too, so the
> last smp_load_acquire seems a bit nonsensical.
>
> If you want to load a value and say "this value is now sensible for
> everything that follows", I think you should load it *first*. No?
>
> IOW, wouldn't the whole sequence make more sense as
>
> dequeue_limit = smp_load_acquire(&rtp->percpu_dequeue_lim);
> for (cpu = 0; cpu < dequeue_limit; cpu++) {
>
> and say that everything in rcu_tasks_need_gpcb() is ordered wrt the
> initial limit on entry?
>
> I dunno. That use of "smp_load_acquire()" just seems odd. Memory
> ordering is hard to understand to begin with, but then when you have
> things like loops that do the same ordered load multiple times, it
> goes from "hard to understand" to positively confusing.

Excellent point. I am queueing that change with your Suggested-by.
If testing goes well, it will be as shown below.

Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 83049a893de5..94bb5abdbb37 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -432,6 +432,7 @@ static void rcu_barrier_tasks_generic(struct rcu_tasks *rtp)
static int rcu_tasks_need_gpcb(struct rcu_tasks *rtp)
{
int cpu;
+ int dequeue_limit;
unsigned long flags;
bool gpdone = poll_state_synchronize_rcu(rtp->percpu_dequeue_gpseq);
long n;
@@ -439,7 +440,8 @@ static int rcu_tasks_need_gpcb(struct rcu_tasks *rtp)
long ncbsnz = 0;
int needgpcb = 0;

- for (cpu = 0; cpu < smp_load_acquire(&rtp->percpu_dequeue_lim); cpu++) {
+ dequeue_limit = smp_load_acquire(&rtp->percpu_dequeue_lim);
+ for (cpu = 0; cpu < dequeue_limit; cpu++) {
struct rcu_tasks_percpu *rtpcp = per_cpu_ptr(rtp->rtpcpu, cpu);

/* Advance and accelerate any new callbacks. */