Re: [PATCH] Documentation: atomic_t.txt: Explain ordering provided by smp_mb__{before,after}_atomic()

From: Paul E. McKenney
Date: Sat Apr 20 2019 - 04:54:55 EST


On Sat, Apr 20, 2019 at 10:26:15AM +1000, Nicholas Piggin wrote:
> Paul E. McKenney's on April 20, 2019 4:26 am:
> > On Fri, Apr 19, 2019 at 08:00:17PM +0200, Peter Zijlstra wrote:
> >> On Fri, Apr 19, 2019 at 01:21:45PM -0400, Alan Stern wrote:
> >> > Index: usb-devel/Documentation/atomic_t.txt
> >> > ===================================================================
> >> > --- usb-devel.orig/Documentation/atomic_t.txt
> >> > +++ usb-devel/Documentation/atomic_t.txt
> >> > @@ -171,7 +171,10 @@ The barriers:
> >> > smp_mb__{before,after}_atomic()
> >> >
> >> > only apply to the RMW ops and can be used to augment/upgrade the ordering
> >> > -inherent to the used atomic op. These barriers provide a full smp_mb().
> >> > +inherent to the used atomic op. Unlike normal smp_mb() barriers, they order
> >> > +only the RMW op itself against the instructions preceding the
> >> > +smp_mb__before_atomic() or following the smp_mb__after_atomic(); they do
> >> > +not order instructions on the other side of the RMW op at all.
> >>
> >> Now it is I who is confused; what?
> >>
> >> x = 1;
> >> smp_mb__before_atomic();
> >> atomic_add(1, &a);
> >> y = 1;
> >>
> >> the stores to both x and y will be ordered as if an smp_mb() where
> >> there. There is no order between a and y otoh.
> >
> > Let's look at x86. And a slightly different example:
> >
> > x = 1;
> > smp_mb__before_atomic();
> > atomic_add(1, &a);
> > r1 = y;
> >
> > The atomic_add() asm does not have the "memory" constraint, which is
> > completely legitimate because atomic_add() does not return a value,
> > and thus guarantees no ordering. The compiler is therefore within
> > its rights to transform the code into the following:
> >
> > x = 1;
> > smp_mb__before_atomic();
> > r1 = y;
> > atomic_add(1, &a);
> >
> > But x86's smp_mb__before_atomic() is just a compiler barrier, and
> > x86 is further allowed to reorder prior stores with later loads.
> > The CPU can therefore execute this code as follows:
> >
> > r1 = y;
> > x = 1;
> > smp_mb__before_atomic();
> > atomic_add(1, &a);
> >
> > So in general, the ordering is guaranteed only to the atomic itself,
> > not to accesses on the other side of the atomic.
>
> That's interesting. I don't think that's what all our code expects.
> I had the same idea as Peter.
>
> IIRC the primitive was originally introduced exactly so x86 would not
> need to have the unnecessary hardware barrier with sequences like
>
> smp_mb();
> ...
> atomic_inc(&v);
>
> The "new" semantics are a bit subtle. One option might be just to
> replace it entirely with atomic_xxx_mb() ?

Hmmm... There are more than 2,000 uses of atomic_inc() in the kernel.
There are about 300-400 total between smp_mb__before_atomic() and
smp_mb__after_atomic().

So what are our options?

1. atomic_xxx_mb() as you say.

From a quick scan of smp_mb__before_atomic() uses, we need this
for atomic_inc(), atomic_dec(), atomic_add(), atomic_sub(),
clear_bit(), set_bit(), test_bit(), atomic_long_dec(),
atomic_long_add(), refcount_dec(), cmpxchg_relaxed(),
set_tsk_thread_flag(), clear_bit_unlock().

Another random look identifies atomic_andnot().

And atomic_set(): set_preempt_state(). This fails
on x86, s390, and TSO friends, does it not? Or is
this ARM-only? Still, why not just smp_mb() before and
after? Same issue in __kernfs_new_node(), bio_cnt_set(),
sbitmap_queue_update_wake_batch(),

Ditto for atomic64_set() in __ceph_dir_set_complete().

Ditto for atomic_read() in rvt_qp_is_avail(). This function
has a couple of other oddly placed smp_mb__before_atomic().

And atomic_cmpxchg(): msc_buffer_alloc(). This instance
of smp_mb__before_atomic() can be removed unless I am missing
something subtle. Ditto for kvm_vcpu_exiting_guest_mode(),
pv_kick_node(), __sbq_wake_up(),

And lock acquisition??? acm_read_bulk_callback().

In nfnl_acct_fill_info(), a smp_mb__before_atomic() after
a atomic64_xchg()??? Also before a clear_bit(), but the
clear_bit() is inside an "if".

There are a few cases that would see added overhead. For example,
svc_get_next_xprt() has the following:

smp_mb__before_atomic();
clear_bit(SP_CONGESTED, &pool->sp_flags);
clear_bit(RQ_BUSY, &rqstp->rq_flags);
smp_mb__after_atomic();

And xs_sock_reset_connection_flags() has this:

smp_mb__before_atomic();
clear_bit(XPRT_CLOSE_WAIT, &xprt->state);
clear_bit(XPRT_CLOSING, &xprt->state);
xs_sock_reset_state_flags(xprt); /* Also a clear_bit(). */
smp_mb__after_atomic();

Yeah, there are more than a few misuses, aren't there? :-/
A coccinelle script seems in order. In 0day test robot.

But there are a number of helper functions whose purpose
seems to be to wrap an atomic in smp_mb__before_atomic() and
smp_mb__after_atomic(), so some of the atomic_xxx_mb() functions
might be a good idea just for improved readability.

2. Add something to checkpatch.pl warning about non-total ordering,
with the error message explaining that the ordering is partial.

3. Make non-value-returning atomics provide full ordering.
This would of course need some benchmarking, but would be a
simple change to make and would eliminate a large class of
potential bugs. My guess is that the loss in performance
would be non-negligible, but who knows?

4. Your idea here!

Thanx, Paul