Re: RCU vs NOHZ

From: Paul E. McKenney
Date: Fri Sep 23 2022 - 14:15:43 EST


On Fri, Sep 23, 2022 at 01:47:40PM -0400, Joel Fernandes wrote:
> On Fri, Sep 23, 2022 at 12:01 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> [...]
> > > > And here is an untested patch that in theory might allow much of the
> > > > reduction in power with minimal complexity/overhead for kernels without
> > > > rcu_nocbs CPUs. On the off-chance you know of someone who would be
> > > > willing to do a realistic evaluation of it.
> > > >
> > > > Thanx, Paul
> > > >
> > > > ------------------------------------------------------------------------
> > > >
> > > > commit 80fc02e80a2dfb6c7468217cff2d4494a1c4b58d
> > > > Author: Paul E. McKenney <paulmck@xxxxxxxxxx>
> > > > Date: Wed Sep 21 13:30:24 2022 -0700
> > > >
> > > > rcu: Let non-offloaded idle CPUs with callbacks defer tick
> > > >
> > > > When a CPU goes idle, rcu_needs_cpu() is invoked to determine whether or
> > > > not RCU needs the scheduler-clock tick to keep interrupting. Right now,
> > > > RCU keeps the tick on for a given idle CPU if there are any non-offloaded
> > > > callbacks queued on that CPU.
> > > >
> > > > But if all of these callbacks are waiting for a grace period to finish,
> > > > there is no point in scheduling a tick before that grace period has any
> > > > reasonable chance of completing. This commit therefore delays the tick
> > > > in the case where all the callbacks are waiting for a specific grace
> > > > period to elapse. In theory, this should result in a 50-70% reduction in
> > > > RCU-induced scheduling-clock ticks on mostly-idle CPUs. In practice, TBD.
> > > >
> > > > Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
> > > > Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> > > >
> > > > diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
> > > > index 9bc025aa79a3..84e930c11065 100644
> > > > --- a/include/linux/rcutiny.h
> > > > +++ b/include/linux/rcutiny.h
> > > > @@ -133,7 +133,7 @@ static inline void rcu_softirq_qs(void)
> > > > rcu_tasks_qs(current, (preempt)); \
> > > > } while (0)
> > > >
> > > > -static inline int rcu_needs_cpu(void)
> > > > +static inline int rcu_needs_cpu(u64 basemono, u64 *nextevt)
> > > > {
> > > > return 0;
> > > > }
> > > > diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
> > > > index 70795386b9ff..3066e0975022 100644
> > > > --- a/include/linux/rcutree.h
> > > > +++ b/include/linux/rcutree.h
> > > > @@ -19,7 +19,7 @@
> > > >
> > > > void rcu_softirq_qs(void);
> > > > void rcu_note_context_switch(bool preempt);
> > > > -int rcu_needs_cpu(void);
> > > > +int rcu_needs_cpu(u64 basemono, u64 *nextevt);
> > > > void rcu_cpu_stall_reset(void);
> > > >
> > > > /*
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index 5ec97e3f7468..47cd3b0d2a07 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -676,12 +676,33 @@ void __rcu_irq_enter_check_tick(void)
> > > > * scheduler-clock interrupt.
> > > > *
> > > > * Just check whether or not this CPU has non-offloaded RCU callbacks
> > > > - * queued.
> > > > + * queued that need immediate attention.
> > > > */
> > > > -int rcu_needs_cpu(void)
> > > > +int rcu_needs_cpu(u64 basemono, u64 *nextevt)
> > > > {
> > > > - return !rcu_segcblist_empty(&this_cpu_ptr(&rcu_data)->cblist) &&
> > > > - !rcu_rdp_is_offloaded(this_cpu_ptr(&rcu_data));
> > > > + struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
> > > > + struct rcu_segcblist *rsclp = &rdp->cblist;
> > > > +
> > > > + // Disabled, empty, or offloaded means nothing to do.
> > > > + if (!rcu_segcblist_is_enabled(rsclp) ||
> > > > + rcu_segcblist_empty(rsclp) || rcu_rdp_is_offloaded(rdp)) {
> > > > + *nextevt = KTIME_MAX;
> > > > + return 0;
> > > > + }
> > > > +
> > > > + // Callbacks ready to invoke or that have not already been
> > > > + // assigned a grace period need immediate attention.
> > > > + if (!rcu_segcblist_segempty(rsclp, RCU_DONE_TAIL) ||
> > > > + !rcu_segcblist_segempty(rsclp, RCU_NEXT_TAIL))
> > > > + return 1;> +
> > > > + // There are callbacks waiting for some later grace period.
> > > > + // Wait for about a grace period or two for the next tick, at which
> > > > + // point there is high probability that this CPU will need to do some
> > > > + // work for RCU.
> > > > + *nextevt = basemono + TICK_NSEC * (READ_ONCE(jiffies_till_first_fqs) > + READ_ONCE(jiffies_till_next_fqs) + 1);
> > >
> > > Looks like nice idea. Could this race with the main GP thread on another CPU
> > > completing the grace period, then on this CPU there is actually some work to do
> > > but rcu_needs_cpu() returns 0.
> > >
> > > I think it is plausible but not common, in which case the extra delay is
> > > probably Ok.
> >
> > Glad you like it!
> >
> > Yes, that race can happen, but it can also happen today.
> > A scheduling-clock interrupt might arrive at a CPU just as a grace
> > period finishes. Yes, the delay is longer with this patch. If this
> > proves to be a problem, then the delay heuristic might expanded to
> > include the age of the current grace period.
> >
> > But keeping it simple to start with.
>
> Sure sounds good and yes I agree to the point of the existing issue
> but the error is just 1 jiffie there as you pointed.

One jiffy currently, but it would typically be about seven jiffies with
the patch. Systems with smaller values of HZ would have fewer jiffies,
and systems with more than 128 CPUs would have more jiffies. Systems
booted with explicit values for the rcutree.jiffies_till_first_fqs and
rcutree.jiffies_till_next_fqs kernel boot parameters could have whatever
the administrator wanted. ;-)

But the key point is that the grace period itself can be extended by
that value just due to timing and distribution of idle CPUs.

> > > Also, if the RCU readers take a long time, then we'd still wake up the system
> > > periodically although with the above change, much fewer times, which is a good
> > > thing.
> >
> > And the delay heuristic could also be expanded to include a digitally
> > filtered estimate of grace-period duration. But again, keeping it simple
> > to start with. ;-)
> >
> > My guess is that offloading gets you more power savings, but I don't
> > have a good way of testing this guess.
>
> I could try to run turbostat on Monday on our Intel SoCs, and see how
> it reacts, but I was thinking of tracing this first to see the
> behavior. Another thing I was thinking of was updating (the future)
> rcutop to see how many 'idle ticks' are RCU related, vs others; and
> then see how this patch effects that.

Such testing would be very welcome, thank you!

This patch might also need to keep track of the last tick on a given
CPU in order to prevent frequent short idle periods from indefinitely
delaying the tick.

Thanx, Paul

> thanks,
>
> - Joel
>
>
> > > > unsigned long basejiff;
> > > > unsigned int seq;
> > > >
> > > > @@ -807,7 +807,7 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
> > > > * minimal delta which brings us back to this place
> > > > * immediately. Lather, rinse and repeat...
> > > > */
> > > > - if (rcu_needs_cpu() || arch_needs_cpu() ||
> > > > + if (rcu_needs_cpu(basemono, &next_rcu) || arch_needs_cpu() ||
> > > > irq_work_needs_cpu() || local_timer_softirq_pending()) {
> > > > next_tick = basemono + TICK_NSEC;
> > > > } else {
> > > > @@ -818,8 +818,10 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
> > > > * disabled this also looks at the next expiring
> > > > * hrtimer.
> > > > */
> > > > - next_tick = get_next_timer_interrupt(basejiff, basemono);
> > > > - ts->next_timer = next_tick;
> > > > + next_tmr = get_next_timer_interrupt(basejiff, basemono);
> > > > + ts->next_timer = next_tmr;
> > > > + /* Take the next rcu event into account */
> > > > + next_tick = next_rcu < next_tmr ? next_rcu : next_tmr;
> > > > }
> > > >
> > > > /*