Re: Internal vs. external barriers (was: Re: Interesting LKMM litmus test)

From: Alan Stern
Date: Sun Jan 22 2023 - 15:32:30 EST


On Fri, Jan 20, 2023 at 01:20:37PM -0800, Paul E. McKenney wrote:
> On Fri, Jan 20, 2023 at 03:36:24PM -0500, Alan Stern wrote:
> > On Fri, Jan 20, 2023 at 11:20:32AM -0800, Paul E. McKenney wrote:
> > > On Fri, Jan 20, 2023 at 01:37:51PM -0500, Alan Stern wrote:
> > > > srcu_read_unlock() does not need a full smp_mb().
> > >
> > > That is quite possible, and that is what we are looking into. And testing
> > > thus far agrees with you. But the grace-period ordering constraints
> > > are quite severe, so this requires careful checking and severe testing.
> >
> > If you're interested, I can provide a simple argument to show that the
> > Fundamental Law of RCU would continue to hold with only a release fence.
> > There is an added requirement: merely that synchronize_srcu() must have
> > an smp_mb() somewhere after its final read of the unlock counters --
> > which your version of the algorithm already has.
>
> Please!
>
> For your amusement, here is a very informal argument that this is
> the case:
>
> https://docs.google.com/document/d/1xvwQzavmH474MBPAIBqVyvCrCcS5j2BpqhErPhRj7Is/edit?usp=sharing
>
> See the "Read-Side Optimizations" section at the end.

It looks like you've got the basic idea. Most of the complications seem
to arise from the different ways a grace period can happen.

Here's what I was thinking. Let C be a read-side critical section, with
L being its invocation of srcu_down_read() and U being the matching
invocation of srcu_up_read(). Let idx be the index value read by L (and
used by U). I will assume that L has the form:

idx = READ_ONCE(ss->index);
temp = this_cpu(ss->lock)[idx];
WRITE_ONCE(this_cpu(ss->lock)[idx], temp + 1)
smp_mb();

(or whatever is the right syntax for incrementing a per-cpu array
element). Likewise, assume U has the form:

temp = this_cpu(ss->unlock)[idx];
smp_store_release(&this_cpu(ss->unlock)[idx], temp + 1);

Let G be any SRCU grace period -- an invocation of synchronize_srcu(ss).
Assume G has the overall form:

accumulate_and_compare_loop(!ss->index);
smp_mb();
WRITE_ONCE(ss->index, !ss->index);
smp_mb();
accumulate_and_compare_loop(!ss->index);

where accumulate_and_compare_loop(i) has the form:

do {
s = t = 0;
for each CPU c:
s += READ_ONCE(cpu(c, ss->unlock)[i]);
smp_mb();
for each CPU c:
t += READ_ONCE(cpu(c, ss->lock)[i]);
} while (s != t);

It's not too hard to show, and I trust you already believe, that in the
final iteration of the accumulate_and_compare_loop(i) call for which
i = idx, the lock-counter increment in L is observed if and only if the
unlock-counter increment in U is observed. Thus we have two cases:

Case 1: Both of the increments are observed. Since the increment in U
is a store-release, every write that propagated to U's CPU before the
increment is therefore visible to G's CPU before its last read of an
unlock counter. Since the full fence in accumulate_and_compare_loop()
is executed after the last such read, these writes must propagate to
every CPU before G ends.

Case 2: Neither of the increments is observed. Let W be any write which
propagated to G's CPU before G started. Does W propagate to C before L
ends? We have the following SB or RWC pattern:

G C
------------------------ -----------------------
W propagates to G's CPU L writes lock counter
G does smp_mb() L does smp_mb()
G reads L's lock counter W propagates to C's CPU

(The smp_mb() in the left column is the one in
accumulate_and_compare_loop(idx), which precedes the reads of the lock
counters.)

If L's smp_mb() ended before G's did then L's write to the lock counter
would have propagated to G's CPU before G's smp_mb() ended, and hence G
would have observed the lock-counter increment. Since this didn't
happen, we know that G's smp_mb() ends before L's does. This means that
W must propagate to every CPU before L terminates, and hence before C's
critical section starts.

Together, these two cases cover the requirements of the Fundamental Law
of RCU. The memory barrier in U was needed only in Case 1, and there it
only needed to be a release fence.

Alan