Re: dyntick-idle CPU and node's qsmask

From: Paul E. McKenney
Date: Sun Nov 11 2018 - 13:36:26 EST


On Sun, Nov 11, 2018 at 10:09:16AM -0800, Joel Fernandes wrote:
> On Sat, Nov 10, 2018 at 08:22:10PM -0800, Paul E. McKenney wrote:
> > On Sat, Nov 10, 2018 at 07:09:25PM -0800, Joel Fernandes wrote:
> > > On Sat, Nov 10, 2018 at 03:04:36PM -0800, Paul E. McKenney wrote:
> > > > On Sat, Nov 10, 2018 at 01:46:59PM -0800, Joel Fernandes wrote:
> > > > > Hi Paul and everyone,
> > > > >
> > > > > I was tracing/studying the RCU code today in paul/dev branch and noticed that
> > > > > for dyntick-idle CPUs, the RCU GP thread is clearing the rnp->qsmask
> > > > > corresponding to the leaf node for the idle CPU, and reporting a QS on their
> > > > > behalf.
> > > > >
> > > > > rcu_sched-10 [003] 40.008039: rcu_fqs: rcu_sched 792 0 dti
> > > > > rcu_sched-10 [003] 40.008039: rcu_fqs: rcu_sched 801 2 dti
> > > > > rcu_sched-10 [003] 40.008041: rcu_quiescent_state_report: rcu_sched 805 5>0 0 0 3 0
> > > > >
> > > > > That's all good but I was wondering if we can do better for the idle CPUs if
> > > > > we can some how not set the qsmask of the node in the first place. Then no
> > > > > reporting would be needed of quiescent state is needed for idle CPUs right?
> > > > > And we would also not need to acquire the rnp lock I think.
> > > > >
> > > > > At least for a single node tree RCU system, it seems that would avoid needing
> > > > > to acquire the lock without complications. Anyway let me know your thoughts
> > > > > and happy to discuss this at the hallways of the LPC as well for folks
> > > > > attending :)
> > > >
> > > > We could, but that would require consulting the rcu_data structure for
> > > > each CPU while initializing the grace period, thus increasing the number
> > > > of cache misses during grace-period initialization and also shortly after
> > > > for any non-idle CPUs. This seems backwards on busy systems where each
> > >
> > > When I traced, it appears to me that rcu_data structure of a remote CPU was
> > > being consulted anyway by the rcu_sched thread. So it seems like such cache
> > > miss would happen anyway whether it is during grace-period initialization or
> > > during the fqs stage? I guess I'm trying to say, the consultation of remote
> > > CPU's rcu_data happens anyway.
> >
> > Hmmm...
> >
> > The rcu_gp_init() function does access an rcu_data structure, but it is
> > that of the current CPU, so shouldn't involve a communications cache miss,
> > at least not in the common case.
> >
> > Or are you seeing these cross-CPU rcu_data accesses in rcu_gp_fqs() or
> > functions that it calls? In that case, please see below.
>
> Yes, it was rcu_implicit_dynticks_qs called from rcu_gp_fqs.
>
> > > > CPU will with high probability report its own quiescent state before three
> > > > jiffies pass, in which case the cache misses on the rcu_data structures
> > > > would be wasted motion.
> > >
> > > If all the CPUs are busy and reporting their QS themselves, then I think the
> > > qsmask is likely 0 so then rcu_implicit_dynticks_qs (called from
> > > force_qs_rnp) wouldn't be called and so there would no cache misses on
> > > rcu_data right?
> >
> > Yes, but assuming that all CPUs report their quiescent states before
> > the first call to rcu_gp_fqs(). One exception is when some CPU is
> > looping in the kernel for many milliseconds without passing through a
> > quiescent state. This is because for recent kernels, cond_resched()
> > is not a quiescent state until the grace period is something like 100
> > milliseconds old. (For older kernels, cond_resched() was never an RCU
> > quiescent state unless it actually scheduled.)
> >
> > Why wait 100 milliseconds? Because otherwise the increase in
> > cond_resched() overhead shows up all too well, causing 0day test robot
> > to complain bitterly. Besides, I would expect that in the common case,
> > CPUs would be executing usermode code.
>
> Makes sense. I was also wondering about this other thing you mentioned about
> waiting for 3 jiffies before reporting the idle CPU's quiescent state. Does
> that mean that even if a single CPU is dyntick-idle for a long period of
> time, then the minimum grace period duration would be atleast 3 jiffies? In
> our mobile embedded devices, jiffies is set to 3.33ms (HZ=300) to keep power
> consumption low. Not that I'm saying its an issue or anything (since IIUC if
> someone wants shorter grace periods, they should just use expedited GPs), but
> it sounds like it would be shorter GP if we just set the qsmask early on some
> how and we can manage the overhead of doing so.

First, there is some autotuning of the delay based on HZ:

#define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))

So at HZ=300, you should be seeing a two-jiffy delay rather than the
usual HZ=1000 three-jiffy delay. Of course, this means that the delay
is 6.67ms rather than the usual 3ms, but the theory is that lower HZ
rates often mean slower instruction execution and thus a desire for
lower RCU overhead. There is further autotuning based on number of
CPUs, but this does not kick in until you have 256 CPUs on your system,
and I bet that smartphones aren't there yet. Nevertheless, check out
RCU_JIFFIES_FQS_DIV for more info on this.

But you can always override this autotuning using the following kernel
boot paramters:

rcutree.jiffies_till_first_fqs
rcutree.jiffies_till_next_fqs

You can even set the first one to zero if you want the effect of pre-scanning
for idle CPUs. ;-)

The second must be set to one or greater.

Both are capped at one second (HZ).

> > Ah, did you build with NO_HZ_FULL, boot with nohz_full CPUs, and then run
> > CPU-bound usermode workloads on those CPUs? Such CPUs would appear to
> > be idle from an RCU perspective. But these CPUs would never touch their
> > rcu_data structures, so they would likely remain in the RCU grace-period
> > kthread's cache. So this should work well also. Give or take that other
> > work would likely eject them from the cache, but in that case they would
> > be capacity cache misses rather than the aforementioned communications
> > cache misses. Not that this distinction matters to whoever is measuring
> > performance. ;-)
>
> Ah ok :-) I had booted with !CONFIG_NO_HZ_FULL for this test.

Never mind, then. ;-)

> > > Anyway it was just an idea that popped up when I was going through traces :)
> > > Thanks for the discussion and happy to discuss further or try out anything.
> >
> > Either way, I do appreciate your going through this. People have found
> > RCU bugs this way, one of which involved RCU uselessly calling a particular
> > function twice in quick succession. ;-)
>
> Thanks. It is my pleasure and happy to help :) I'll keep digging into it.

Looking forward to further questions and patches. ;-)

Thanx, Paul