Re: RCU vs NOHZ
From: Paul E. McKenney
Date: Wed Sep 21 2022 - 17:36:52 EST
On Sat, Sep 17, 2022 at 07:25:08AM -0700, Paul E. McKenney wrote:
> On Fri, Sep 16, 2022 at 11:20:14AM +0200, Peter Zijlstra wrote:
> > On Fri, Sep 16, 2022 at 12:58:17AM -0700, Paul E. McKenney wrote:
> >
> > > To the best of my knowledge at this point in time, agreed. Who knows
> > > what someone will come up with next week? But for people running certain
> > > types of real-time and HPC workloads, context tracking really does handle
> > > both idle and userspace transitions.
> >
> > Sure, but idle != nohz. Nohz is where we disable the tick, and currently
> > RCU can inhibit this -- rcu_needs_cpu().
>
> Exactly. For non-nohz userspace execution, the tick is still running
> anyway, so RCU of course won't be inhibiting its disabling. And in that
> case, RCU's hook is the tick interrupt itself. RCU's hook is passed a
> flag saying whether the interrupt came from userspace or from kernel.
>
> > AFAICT there really isn't an RCU hook for this, not through context
> > tracking not through anything else.
>
> There is a directly invoked RCU hook for any transition that enables or
> disables the tick, namely the ct_*_enter() and ct_*_exit() functions,
> that is, those functions formerly known as rcu_*_enter() and rcu_*_exit().
>
> > > It wasn't enabled for ChromeOS.
> > >
> > > When fully enabled, it gave them the energy-efficiency advantages Joel
> > > described. And then Joel described some additional call_rcu_lazy()
> > > changes that provided even better energy efficiency. Though I believe
> > > that the application should also be changed to avoid incessantly opening
> > > and closing that file while the device is idle, as this would remove
> > > -all- RCU work when nearly idle. But some of the other call_rcu_lazy()
> > > use cases would likely remain.
> >
> > So I'm thinking the scheme I outlined gets you most if not all of what
> > lazy would get you without having to add the lazy thing. A CPU is never
> > refused deep idle when it passes off the callbacks.
> >
> > The NOHZ thing is a nice hook for 'this-cpu-wants-to-go-idle-long-term'
> > and do our utmost bestest to move work away from it. You *want* to break
> > affinity at this point.
> >
> > If you hate on the global, push it to a per rcu_node offload list until
> > the whole node is idle and then push it up the next rcu_node level until
> > you reach the top.
> >
> > Then when the top rcu_node is full idle; you can insta progress the QS
> > state and run the callbacks and go idle.
>
> Unfortunately, the overhead of doing all that tracking along with
> resolving all the resulting race conditions will -increase- power
> consumption. With RCU, counting CPU wakeups is not as good a predictor
> of power consumption as one might like. Sure, it is a nice heuristic
> in some cases, but with RCU it is absolutely -not- a replacement for
> actually measuring power consumption on real hardware. And yes, I did
> learn this the hard way. Why do you ask? ;-)
>
> And that is why the recently removed CONFIG_RCU_FAST_NO_HZ left the
> callbacks in place and substituted a 4x slower timer for the tick.
> -That- actually resulted in significant real measured power savings on
> real hardware.
>
> Except that everything that was building with CONFIG_RCU_FAST_NO_HZ
> was also doing nohz_full on each and every CPU. Which meant that all
> that CONFIG_RCU_FAST_NO_HZ was doing for them was adding an additional
> useless check on each transition to and from idle. Which in turn is why
> CONFIG_RCU_FAST_NO_HZ was removed. No one was using it in any way that
> made any sense.
>
> And more recent testing with rcu_nocbs on both ChromeOS and Android has
> produced better savings than was produced by CONFIG_RCU_FAST_NO_HZ anyway.
>
> Much of the additional savings from Joel et al.'s work is not so much
> from reducing the number of ticks, but rather from reducing the number
> of grace periods, which are of course much heavier weight.
>
> And this of course means that any additional schemes to reduce RCU's
> power consumption must be compared (with real measurements on real
> hardware!) to Joel et al.'s work, whether in combination or as an
> alternative. And either way, the power savings must of course justify
> the added code and complexity.
And here is an untested patch that in theory might allow much of the
reduction in power with minimal complexity/overhead for kernels without
rcu_nocbs CPUs. On the off-chance you know of someone who would be
willing to do a realistic evaluation of it.
Thanx, Paul
------------------------------------------------------------------------
commit 80fc02e80a2dfb6c7468217cff2d4494a1c4b58d
Author: Paul E. McKenney <paulmck@xxxxxxxxxx>
Date: Wed Sep 21 13:30:24 2022 -0700
rcu: Let non-offloaded idle CPUs with callbacks defer tick
When a CPU goes idle, rcu_needs_cpu() is invoked to determine whether or
not RCU needs the scheduler-clock tick to keep interrupting. Right now,
RCU keeps the tick on for a given idle CPU if there are any non-offloaded
callbacks queued on that CPU.
But if all of these callbacks are waiting for a grace period to finish,
there is no point in scheduling a tick before that grace period has any
reasonable chance of completing. This commit therefore delays the tick
in the case where all the callbacks are waiting for a specific grace
period to elapse. In theory, this should result in a 50-70% reduction in
RCU-induced scheduling-clock ticks on mostly-idle CPUs. In practice, TBD.
Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 9bc025aa79a3..84e930c11065 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -133,7 +133,7 @@ static inline void rcu_softirq_qs(void)
rcu_tasks_qs(current, (preempt)); \
} while (0)
-static inline int rcu_needs_cpu(void)
+static inline int rcu_needs_cpu(u64 basemono, u64 *nextevt)
{
return 0;
}
diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index 70795386b9ff..3066e0975022 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -19,7 +19,7 @@
void rcu_softirq_qs(void);
void rcu_note_context_switch(bool preempt);
-int rcu_needs_cpu(void);
+int rcu_needs_cpu(u64 basemono, u64 *nextevt);
void rcu_cpu_stall_reset(void);
/*
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 5ec97e3f7468..47cd3b0d2a07 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -676,12 +676,33 @@ void __rcu_irq_enter_check_tick(void)
* scheduler-clock interrupt.
*
* Just check whether or not this CPU has non-offloaded RCU callbacks
- * queued.
+ * queued that need immediate attention.
*/
-int rcu_needs_cpu(void)
+int rcu_needs_cpu(u64 basemono, u64 *nextevt)
{
- return !rcu_segcblist_empty(&this_cpu_ptr(&rcu_data)->cblist) &&
- !rcu_rdp_is_offloaded(this_cpu_ptr(&rcu_data));
+ struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
+ struct rcu_segcblist *rsclp = &rdp->cblist;
+
+ // Disabled, empty, or offloaded means nothing to do.
+ if (!rcu_segcblist_is_enabled(rsclp) ||
+ rcu_segcblist_empty(rsclp) || rcu_rdp_is_offloaded(rdp)) {
+ *nextevt = KTIME_MAX;
+ return 0;
+ }
+
+ // Callbacks ready to invoke or that have not already been
+ // assigned a grace period need immediate attention.
+ if (!rcu_segcblist_segempty(rsclp, RCU_DONE_TAIL) ||
+ !rcu_segcblist_segempty(rsclp, RCU_NEXT_TAIL))
+ return 1;
+
+ // There are callbacks waiting for some later grace period.
+ // Wait for about a grace period or two for the next tick, at which
+ // point there is high probability that this CPU will need to do some
+ // work for RCU.
+ *nextevt = basemono + TICK_NSEC * (READ_ONCE(jiffies_till_first_fqs) +
+ READ_ONCE(jiffies_till_next_fqs) + 1);
+ return 0;
}
/*
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index b0e3c9205946..303ea15cdb96 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -784,7 +784,7 @@ static inline bool local_timer_softirq_pending(void)
static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
{
- u64 basemono, next_tick, delta, expires;
+ u64 basemono, next_tick, next_tmr, next_rcu, delta, expires;
unsigned long basejiff;
unsigned int seq;
@@ -807,7 +807,7 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
* minimal delta which brings us back to this place
* immediately. Lather, rinse and repeat...
*/
- if (rcu_needs_cpu() || arch_needs_cpu() ||
+ if (rcu_needs_cpu(basemono, &next_rcu) || arch_needs_cpu() ||
irq_work_needs_cpu() || local_timer_softirq_pending()) {
next_tick = basemono + TICK_NSEC;
} else {
@@ -818,8 +818,10 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
* disabled this also looks at the next expiring
* hrtimer.
*/
- next_tick = get_next_timer_interrupt(basejiff, basemono);
- ts->next_timer = next_tick;
+ next_tmr = get_next_timer_interrupt(basejiff, basemono);
+ ts->next_timer = next_tmr;
+ /* Take the next rcu event into account */
+ next_tick = next_rcu < next_tmr ? next_rcu : next_tmr;
}
/*