Re: RCU vs NOHZ

From: Paul E. McKenney
Date: Sat Sep 17 2022 - 10:28:44 EST


On Sat, Sep 17, 2022 at 09:52:49AM -0400, Joel Fernandes wrote:
> On 9/17/2022 9:35 AM, Peter Zijlstra wrote:
> > On Fri, Sep 16, 2022 at 02:11:10PM -0400, Joel Fernandes wrote:
> >> Hi Peter,
> >>
> >> On Fri, Sep 16, 2022 at 5:20 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >> [...]
> >>>> It wasn't enabled for ChromeOS.
> >>>>
> >>>> When fully enabled, it gave them the energy-efficiency advantages Joel
> >>>> described. And then Joel described some additional call_rcu_lazy()
> >>>> changes that provided even better energy efficiency. Though I believe
> >>>> that the application should also be changed to avoid incessantly opening
> >>>> and closing that file while the device is idle, as this would remove
> >>>> -all- RCU work when nearly idle. But some of the other call_rcu_lazy()
> >>>> use cases would likely remain.
> >>>
> >>> So I'm thinking the scheme I outlined gets you most if not all of what
> >>> lazy would get you without having to add the lazy thing. A CPU is never
> >>> refused deep idle when it passes off the callbacks.
> >>>
> >>> The NOHZ thing is a nice hook for 'this-cpu-wants-to-go-idle-long-term'
> >>> and do our utmost bestest to move work away from it. You *want* to break
> >>> affinity at this point.
> >>>
> >>> If you hate on the global, push it to a per rcu_node offload list until
> >>> the whole node is idle and then push it up the next rcu_node level until
> >>> you reach the top.
> >>>
> >>> Then when the top rcu_node is full idle; you can insta progress the QS
> >>> state and run the callbacks and go idle.
> >>
> >> In my opinion the speed brakes have to be applied before the GP and
> >> other threads are even awakened. The issue Android and ChromeOS
> >> observe is that even a single CB queued every few jiffies can cause
> >> work that can be otherwise delayed / batched, to be scheduled in. I am
> >> not sure if your suggestions above address that. Does it?
> >
> > Scheduled how? Is this callbacks doing queue_work() or something?
>
> Way before the callback is even ready to execute, you can rcuog, rcuop,
> rcu_preempt threads running to go through the grace period state machine.
>
> > Anyway; the thinking is that by passing off the callbacks on NOHZ, the
> > idle CPUs stay idle. By running the callbacks before going full idle,
> > all work is done and you can stay idle longer.
>
> But all CPUs idle does not mean grace period is over, you can have a task (at
> least on PREEMPT_RT) block in the middle of an RCU read-side critical section
> and then all CPUs go idle.
>
> Other than that, a typical flow could look like:
>
> 1. CPU queues a callback.
> 2. CPU then goes idle.
> 3. Another CPU is running the RCU threads waking up otherwise idle CPUs.
> 4. Grace period completes and an RCU thread runs a callback.
>
> >> Try this experiment on your ADL system (for fun). Boot to the login
> >> screen on any distro,
> >
> > All my dev boxes are headless :-) I don't thinkt he ADL even has X or
> > wayland installed.
>
> Ah, ok. Maybe what you have (like daemons) are already requesting RCU for
> something. Android folks had some logger requesting RCU all the time.
>
> >> and before logging in, run turbostat over ssh
> >> and observe PC8 percent residencies. Now increase
> >> jiffies_till_first_fqs boot parameter value to 64 or so and try again.
> >> You may be surprised how much PC8 percent increases by delaying RCU
> >> and batching callbacks (via jiffies boot option) Admittedly this is
> >> more amplified on ADL because of package-C-states, firmware and what
> >> not, and isn’t as much a problem on Android; but still gives a nice
> >> power improvement there.
> >
> > I can try; but as of now turbostat doesn't seem to work on that thing at
> > all. I think localyesconfig might've stripped a required bit. I'll poke
> > at it later.
>
> Cool! I believe Len Brown can help on that , or maybe there is another way you
> can read the counters to figure out the PC8% and RAPL power.

Whatever the evaluation scheme, it absolutely -must- measure real power
consumed by real hardware running some real-world workload compared to
Joel et al.'s scheme, or I will cheerfully ignore it. ;-)

Thanx, Paul