Re: RCU vs NOHZ
From: Joel Fernandes
Date: Fri Sep 23 2022 - 15:47:13 EST
Hi Paul,
On Fri, Sep 23, 2022 at 2:15 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
>
> On Fri, Sep 23, 2022 at 01:47:40PM -0400, Joel Fernandes wrote:
>> On Fri, Sep 23, 2022 at 12:01 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
>> [...]
>>>>> And here is an untested patch that in theory might allow much of the
>>>>> reduction in power with minimal complexity/overhead for kernels without
>>>>> rcu_nocbs CPUs. On the off-chance you know of someone who would be
>>>>> willing to do a realistic evaluation of it.
>>>>> Thanx, Paul
>>>>> ------------------------------------------------------------------------
>>>>> commit 80fc02e80a2dfb6c7468217cff2d4494a1c4b58d
>>>>> Author: Paul E. McKenney <paulmck@xxxxxxxxxx>
>>>>> Date: Wed Sep 21 13:30:24 2022 -0700
>>>>> rcu: Let non-offloaded idle CPUs with callbacks defer tick
>>>>> When a CPU goes idle, rcu_needs_cpu() is invoked to determine whether or
>>>>> not RCU needs the scheduler-clock tick to keep interrupting. Right now,
>>>>> RCU keeps the tick on for a given idle CPU if there are any non-offloaded
>>>>> callbacks queued on that CPU.
>>>>> But if all of these callbacks are waiting for a grace period to finish,
>>>>> there is no point in scheduling a tick before that grace period has any
>>>>> reasonable chance of completing. This commit therefore delays the tick
>>>>> in the case where all the callbacks are waiting for a specific grace
>>>>> period to elapse. In theory, this should result in a 50-70% reduction in
>>>>> RCU-induced scheduling-clock ticks on mostly-idle CPUs. In practice, TBD.
>>>>> Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
>>>>> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
>>>>> diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
>>>>> index 9bc025aa79a3..84e930c11065 100644
>>>>> --- a/include/linux/rcutiny.h
>>>>> +++ b/include/linux/rcutiny.h
>>>>> @@ -133,7 +133,7 @@ static inline void rcu_softirq_qs(void)
>>>>> rcu_tasks_qs(current, (preempt)); \
>>>>> } while (0)
>>>>> -static inline int rcu_needs_cpu(void)
>>>>> +static inline int rcu_needs_cpu(u64 basemono, u64 *nextevt)
>>>>> {
>>>>> return 0;
>>>>> }
>>>>> diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
>>>>> index 70795386b9ff..3066e0975022 100644
>>>>> --- a/include/linux/rcutree.h
>>>>> +++ b/include/linux/rcutree.h
>>>>> @@ -19,7 +19,7 @@
>>>>> void rcu_softirq_qs(void);
>>>>> void rcu_note_context_switch(bool preempt);
>>>>> -int rcu_needs_cpu(void);
>>>>> +int rcu_needs_cpu(u64 basemono, u64 *nextevt);
>>>>> void rcu_cpu_stall_reset(void);
>>>>> /*
>>>>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>>>>> index 5ec97e3f7468..47cd3b0d2a07 100644
>>>>> --- a/kernel/rcu/tree.c
>>>>> +++ b/kernel/rcu/tree.c
>>>>> @@ -676,12 +676,33 @@ void __rcu_irq_enter_check_tick(void)
>>>>> * scheduler-clock interrupt.
>>>>> *
>>>>> * Just check whether or not this CPU has non-offloaded RCU callbacks
>>>>> - * queued.
>>>>> + * queued that need immediate attention.
>>>>> */
>>>>> -int rcu_needs_cpu(void)
>>>>> +int rcu_needs_cpu(u64 basemono, u64 *nextevt)
>>>>> {
>>>>> - return !rcu_segcblist_empty(&this_cpu_ptr(&rcu_data)->cblist) &&
>>>>> - !rcu_rdp_is_offloaded(this_cpu_ptr(&rcu_data));
>>>>> + struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
>>>>> + struct rcu_segcblist *rsclp = &rdp->cblist;
>>>>> +
>>>>> + // Disabled, empty, or offloaded means nothing to do.
>>>>> + if (!rcu_segcblist_is_enabled(rsclp) ||
>>>>> + rcu_segcblist_empty(rsclp) || rcu_rdp_is_offloaded(rdp)) {
>>>>> + *nextevt = KTIME_MAX;
>>>>> + return 0;
>>>>> + }
>>>>> +
>>>>> + // Callbacks ready to invoke or that have not already been
>>>>> + // assigned a grace period need immediate attention.
>>>>> + if (!rcu_segcblist_segempty(rsclp, RCU_DONE_TAIL) ||
>>>>> + !rcu_segcblist_segempty(rsclp, RCU_NEXT_TAIL))
>>>>> + return 1;> +
>>>>> + // There are callbacks waiting for some later grace period.
>>>>> + // Wait for about a grace period or two for the next tick, at which
>>>>> + // point there is high probability that this CPU will need to do some
>>>>> + // work for RCU.
>>>>> + *nextevt = basemono + TICK_NSEC * (READ_ONCE(jiffies_till_first_fqs) > + READ_ONCE(jiffies_till_next_fqs) + 1);
>>>> Looks like nice idea. Could this race with the main GP thread on another CPU
>>>> completing the grace period, then on this CPU there is actually some work to do
>>>> but rcu_needs_cpu() returns 0.
>>>> I think it is plausible but not common, in which case the extra delay is
>>>> probably Ok.
>>> Glad you like it!
>>> Yes, that race can happen, but it can also happen today.
>>> A scheduling-clock interrupt might arrive at a CPU just as a grace
>>> period finishes. Yes, the delay is longer with this patch. If this
>>> proves to be a problem, then the delay heuristic might expanded to
>>> include the age of the current grace period.
>>> But keeping it simple to start with.
>> Sure sounds good and yes I agree to the point of the existing issue
>> but the error is just 1 jiffie there as you pointed.
>
> One jiffy currently, but it would typically be about seven jiffies with
> the patch
Yes exactly, that’s what I meant.
> . Systems with smaller values of HZ would have fewer jiffies,
> and systems with more than 128 CPUs would have more jiffies. Systems
> booted with explicit values for the rcutree.jiffies_till_first_fqs and
> rcutree.jiffies_till_next_fqs kernel boot parameters could have whatever
> the administrator wanted. ;-)
Makes sense, thanks for clarifying.
> But the key point is that the grace period itself can be extended by
> that value just due to timing and distribution of idle CPUs.
>
>>>> Also, if the RCU readers take a long time, then we'd still wake up the system
>>>> periodically although with the above change, much fewer times, which is a good
>>>> thing.
>>> And the delay heuristic could also be expanded to include a digitally
>>> filtered estimate of grace-period duration. But again, keeping it simple
>>> to start with. ;-)
>>> My guess is that offloading gets you more power savings, but I don't
>>> have a good way of testing this guess.
>> I could try to run turbostat on Monday on our Intel SoCs, and see how
>> it reacts, but I was thinking of tracing this first to see the
>> behavior. Another thing I was thinking of was updating (the future)
>> rcutop to see how many 'idle ticks' are RCU related, vs others; and
>> then see how this patch effects that.
>
> Such testing would be very welcome, thank you!
>
> This patch might also need to keep track of the last tick on a given
> CPU in order to prevent frequent short idle periods from indefinitely
> delaying the tick.
I know what you mean! I had the exact same issue with the lazy timer initially, now the behavior is any lazy enqueue which happened will piggy back onto the existing timer already running.
- Joel
>
>
> Thanx, Paul
>
>> thanks,
>> - Joel
>>>>> unsigned long basejiff;
>>>>> unsigned int seq;
>>>>> @@ -807,7 +807,7 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
>>>>> * minimal delta which brings us back to this place
>>>>> * immediately. Lather, rinse and repeat...
>>>>> */
>>>>> - if (rcu_needs_cpu() || arch_needs_cpu() ||
>>>>> + if (rcu_needs_cpu(basemono, &next_rcu) || arch_needs_cpu() ||
>>>>> irq_work_needs_cpu() || local_timer_softirq_pending()) {
>>>>> next_tick = basemono + TICK_NSEC;
>>>>> } else {
>>>>> @@ -818,8 +818,10 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
>>>>> * disabled this also looks at the next expiring
>>>>> * hrtimer.
>>>>> */
>>>>> - next_tick = get_next_timer_interrupt(basejiff, basemono);
>>>>> - ts->next_timer = next_tick;
>>>>> + next_tmr = get_next_timer_interrupt(basejiff, basemono);
>>>>> + ts->next_timer = next_tmr;
>>>>> + /* Take the next rcu event into account */
>>>>> + next_tick = next_rcu < next_tmr ? next_rcu : next_tmr;
>>>>> }
>>>>> /*