Re: dyntick-idle CPU and node's qsmask

From: Joel Fernandes
Date: Sun Nov 11 2018 - 13:09:23 EST


On Sat, Nov 10, 2018 at 08:22:10PM -0800, Paul E. McKenney wrote:
> On Sat, Nov 10, 2018 at 07:09:25PM -0800, Joel Fernandes wrote:
> > On Sat, Nov 10, 2018 at 03:04:36PM -0800, Paul E. McKenney wrote:
> > > On Sat, Nov 10, 2018 at 01:46:59PM -0800, Joel Fernandes wrote:
> > > > Hi Paul and everyone,
> > > >
> > > > I was tracing/studying the RCU code today in paul/dev branch and noticed that
> > > > for dyntick-idle CPUs, the RCU GP thread is clearing the rnp->qsmask
> > > > corresponding to the leaf node for the idle CPU, and reporting a QS on their
> > > > behalf.
> > > >
> > > > rcu_sched-10 [003] 40.008039: rcu_fqs: rcu_sched 792 0 dti
> > > > rcu_sched-10 [003] 40.008039: rcu_fqs: rcu_sched 801 2 dti
> > > > rcu_sched-10 [003] 40.008041: rcu_quiescent_state_report: rcu_sched 805 5>0 0 0 3 0
> > > >
> > > > That's all good but I was wondering if we can do better for the idle CPUs if
> > > > we can some how not set the qsmask of the node in the first place. Then no
> > > > reporting would be needed of quiescent state is needed for idle CPUs right?
> > > > And we would also not need to acquire the rnp lock I think.
> > > >
> > > > At least for a single node tree RCU system, it seems that would avoid needing
> > > > to acquire the lock without complications. Anyway let me know your thoughts
> > > > and happy to discuss this at the hallways of the LPC as well for folks
> > > > attending :)
> > >
> > > We could, but that would require consulting the rcu_data structure for
> > > each CPU while initializing the grace period, thus increasing the number
> > > of cache misses during grace-period initialization and also shortly after
> > > for any non-idle CPUs. This seems backwards on busy systems where each
> >
> > When I traced, it appears to me that rcu_data structure of a remote CPU was
> > being consulted anyway by the rcu_sched thread. So it seems like such cache
> > miss would happen anyway whether it is during grace-period initialization or
> > during the fqs stage? I guess I'm trying to say, the consultation of remote
> > CPU's rcu_data happens anyway.
>
> Hmmm...
>
> The rcu_gp_init() function does access an rcu_data structure, but it is
> that of the current CPU, so shouldn't involve a communications cache miss,
> at least not in the common case.
>
> Or are you seeing these cross-CPU rcu_data accesses in rcu_gp_fqs() or
> functions that it calls? In that case, please see below.

Yes, it was rcu_implicit_dynticks_qs called from rcu_gp_fqs.

> > > CPU will with high probability report its own quiescent state before three
> > > jiffies pass, in which case the cache misses on the rcu_data structures
> > > would be wasted motion.
> >
> > If all the CPUs are busy and reporting their QS themselves, then I think the
> > qsmask is likely 0 so then rcu_implicit_dynticks_qs (called from
> > force_qs_rnp) wouldn't be called and so there would no cache misses on
> > rcu_data right?
>
> Yes, but assuming that all CPUs report their quiescent states before
> the first call to rcu_gp_fqs(). One exception is when some CPU is
> looping in the kernel for many milliseconds without passing through a
> quiescent state. This is because for recent kernels, cond_resched()
> is not a quiescent state until the grace period is something like 100
> milliseconds old. (For older kernels, cond_resched() was never an RCU
> quiescent state unless it actually scheduled.)
>
> Why wait 100 milliseconds? Because otherwise the increase in
> cond_resched() overhead shows up all too well, causing 0day test robot
> to complain bitterly. Besides, I would expect that in the common case,
> CPUs would be executing usermode code.

Makes sense. I was also wondering about this other thing you mentioned about
waiting for 3 jiffies before reporting the idle CPU's quiescent state. Does
that mean that even if a single CPU is dyntick-idle for a long period of
time, then the minimum grace period duration would be atleast 3 jiffies? In
our mobile embedded devices, jiffies is set to 3.33ms (HZ=300) to keep power
consumption low. Not that I'm saying its an issue or anything (since IIUC if
someone wants shorter grace periods, they should just use expedited GPs), but
it sounds like it would be shorter GP if we just set the qsmask early on some
how and we can manage the overhead of doing so.

> Ah, did you build with NO_HZ_FULL, boot with nohz_full CPUs, and then run
> CPU-bound usermode workloads on those CPUs? Such CPUs would appear to
> be idle from an RCU perspective. But these CPUs would never touch their
> rcu_data structures, so they would likely remain in the RCU grace-period
> kthread's cache. So this should work well also. Give or take that other
> work would likely eject them from the cache, but in that case they would
> be capacity cache misses rather than the aforementioned communications
> cache misses. Not that this distinction matters to whoever is measuring
> performance. ;-)

Ah ok :-) I had booted with !CONFIG_NO_HZ_FULL for this test.

> > Anyway it was just an idea that popped up when I was going through traces :)
> > Thanks for the discussion and happy to discuss further or try out anything.
>
> Either way, I do appreciate your going through this. People have found
> RCU bugs this way, one of which involved RCU uselessly calling a particular
> function twice in quick succession. ;-)

Thanks. It is my pleasure and happy to help :) I'll keep digging into it.

- Joel