Re: [RFC] dynticks: dynticks_idle is only modified locally use this_cpu ops
From: Paul E. McKenney
Date: Wed Sep 03 2014 - 10:44:03 EST
On Wed, Sep 03, 2014 at 09:10:24AM -0500, Christoph Lameter wrote:
> On Tue, 2 Sep 2014, Paul E. McKenney wrote:
>
> > On Tue, Sep 02, 2014 at 06:22:52PM -0500, Christoph Lameter wrote:
> > > On Tue, 2 Sep 2014, Paul E. McKenney wrote:
> > >
> > > > Yep, these two have been on my "when I am feeling insanely gutsy" list
> > > > for quite some time.
> > > >
> > > > But I have to ask... On x86, is a pair of mfence instructions really
> > > > cheaper than an atomic increment?
> > >
> > > Not sure why you would need an mfence instruction?
> >
> > Because otherwise RCU can break. As soon as the grace-period machinery
> > sees that the value of this variable is even, it assumes a quiescent
> > state. If there are no memory barriers, the non-quiescent code might
> > not have completed executing, and your kernel's actuarial statistics
> > become sub-optimal.
>
> Synchronization using per cpu variables is bound to be problematic since
> they are simply not made for that. The per cpu variable usually can change
> without notice to the other cpu since typically per cpu processing is
> ongoing. The improvided performance of per cpu instructions is
> possible only because we exploit the fact that there is no need for
> synchronization.
Christoph, per-CPU variables are memory. If I do the correct operations
on them, they work just like any other memory. And yes, I typically
cannot use the per-CPU operations if I need coherent results visible to
all CPUs (but see your statistical-counters example below). This is of
course exactly why I use atomic operations and memory barriers on
the dynticks counters.
You would prefer that I instead allocated an NR_CPUS-sized array?
> Kernel statistics *are* suboptimal for that very reason because they
> typically sum up individual counters from multiple processors without
> regard to complete accuracy. The manipulation of the VM counters is very
> low overhead due to the lack of concern for synchronization. This is a
> tradeoff vs. performance. We actually can tune the the fuzziness of
> statistics in the VM which allows us to control the overhead generated by
> the need for more or less accurate statistics.
Yes, statistics is one of the cross-CPU uses of per-CPU variables where
you can get away without tight synchronization.
> Memory barriers ensure that the code has completed executing? I think what
> is meant is that they ensure that all modifications to cachelines before the
> change of state are visible and the other processor does not have stale
> cachelines around?
No, memory barriers simply enforce ordering, and ordering is all
that RCU's dyntick-idle code relies on. In other words, if the RCU
grace-period kthread sees a value indicating that a CPU is idle, that
kthread needs assurance that all the memory operations that took place
before that CPU went idle have completed. All that is required for this
is ordering: I saw that the ->dynticks counter had an even value, and
I know that it was preceded a memory barrier, therefore, all subsequent
code after a memory barrier will see the effects of all pre-idle code.
> If the state variable is odd how does the other processor see a state
> change to even before processing is complete if the state is updated only
> at the end of processing?
I am having some difficulty parsing that question.
However, suppose that the grace-period kthread sees ->dynticks with an
odd value, say three. Supppose that this kthread later sees another
odd value, say eleven. Then because of the atomic operations and memory
barriers, the kthread can be sure that the CPU corresponding to that
->dynticks has passed through an idle state since the time it saw the
value three, and therefore that that CPU has passed through a quiescent
state since the start of the grace period.
Similarly, if the grace-period kthread sees ->dynticks with an even value,
it knows that after a subsequent memory barrier, all the pre-idle effects
will be visible, as required.
As a diagram:
CPU 0 CPU 1
/* Access some RCU-protected struct. */
smp_mb__before_atomic()
atomic_inc()->even value
/* Now idle. */
atomic_add_return(0, ...)-> odd value
/* implied memory barrier. */
/* post-GP changes won't interfere */
/* with pre-idle accesses. */
In other words, if CPU 0 accessed some RCU-protected memory before going
idle, that memory is guaranteed not to be freed until after those pre-idle
accesses have completed.
Of course, it also needs to work the other way around:
CPU 0 CPU 1
/* Remove some RCU-protected struct. */
/* implied memory barrier. */
atomic_add_return(0, ...)-> even value
/* Was idle. */
atomic_inc()->odd value
smp_mb__after_atomic()
/* post-idle accesses. */
/* Will see pre-GP changes. */
Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/