Re: RCU vs NOHZ

From: Paul E. McKenney
Date: Sat Sep 17 2022 - 10:25:36 EST


On Fri, Sep 16, 2022 at 11:20:14AM +0200, Peter Zijlstra wrote:
> On Fri, Sep 16, 2022 at 12:58:17AM -0700, Paul E. McKenney wrote:
>
> > To the best of my knowledge at this point in time, agreed. Who knows
> > what someone will come up with next week? But for people running certain
> > types of real-time and HPC workloads, context tracking really does handle
> > both idle and userspace transitions.
>
> Sure, but idle != nohz. Nohz is where we disable the tick, and currently
> RCU can inhibit this -- rcu_needs_cpu().

Exactly. For non-nohz userspace execution, the tick is still running
anyway, so RCU of course won't be inhibiting its disabling. And in that
case, RCU's hook is the tick interrupt itself. RCU's hook is passed a
flag saying whether the interrupt came from userspace or from kernel.

> AFAICT there really isn't an RCU hook for this, not through context
> tracking not through anything else.

There is a directly invoked RCU hook for any transition that enables or
disables the tick, namely the ct_*_enter() and ct_*_exit() functions,
that is, those functions formerly known as rcu_*_enter() and rcu_*_exit().

> > It wasn't enabled for ChromeOS.
> >
> > When fully enabled, it gave them the energy-efficiency advantages Joel
> > described. And then Joel described some additional call_rcu_lazy()
> > changes that provided even better energy efficiency. Though I believe
> > that the application should also be changed to avoid incessantly opening
> > and closing that file while the device is idle, as this would remove
> > -all- RCU work when nearly idle. But some of the other call_rcu_lazy()
> > use cases would likely remain.
>
> So I'm thinking the scheme I outlined gets you most if not all of what
> lazy would get you without having to add the lazy thing. A CPU is never
> refused deep idle when it passes off the callbacks.
>
> The NOHZ thing is a nice hook for 'this-cpu-wants-to-go-idle-long-term'
> and do our utmost bestest to move work away from it. You *want* to break
> affinity at this point.
>
> If you hate on the global, push it to a per rcu_node offload list until
> the whole node is idle and then push it up the next rcu_node level until
> you reach the top.
>
> Then when the top rcu_node is full idle; you can insta progress the QS
> state and run the callbacks and go idle.

Unfortunately, the overhead of doing all that tracking along with
resolving all the resulting race conditions will -increase- power
consumption. With RCU, counting CPU wakeups is not as good a predictor
of power consumption as one might like. Sure, it is a nice heuristic
in some cases, but with RCU it is absolutely -not- a replacement for
actually measuring power consumption on real hardware. And yes, I did
learn this the hard way. Why do you ask? ;-)

And that is why the recently removed CONFIG_RCU_FAST_NO_HZ left the
callbacks in place and substituted a 4x slower timer for the tick.
-That- actually resulted in significant real measured power savings on
real hardware.

Except that everything that was building with CONFIG_RCU_FAST_NO_HZ
was also doing nohz_full on each and every CPU. Which meant that all
that CONFIG_RCU_FAST_NO_HZ was doing for them was adding an additional
useless check on each transition to and from idle. Which in turn is why
CONFIG_RCU_FAST_NO_HZ was removed. No one was using it in any way that
made any sense.

And more recent testing with rcu_nocbs on both ChromeOS and Android has
produced better savings than was produced by CONFIG_RCU_FAST_NO_HZ anyway.

Much of the additional savings from Joel et al.'s work is not so much
from reducing the number of ticks, but rather from reducing the number
of grace periods, which are of course much heavier weight.

And this of course means that any additional schemes to reduce RCU's
power consumption must be compared (with real measurements on real
hardware!) to Joel et al.'s work, whether in combination or as an
alternative. And either way, the power savings must of course justify
the added code and complexity.

Thanx, Paul