Re: [PATCH] blk-mq: Start to fix memory ordering...

From: Peter Zijlstra
Date: Wed Sep 06 2017 - 04:49:12 EST


On Wed, Sep 06, 2017 at 09:13:04AM +0200, Andrea Parri wrote:
> > + smp_mb__before_atomic();
>
> I am wondering whether we should be using smp_wmb() instead: this would
> provide the above guarantee and save a full barrier on powerpc/arm64.

Right, did that.

> > + set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
> > + if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags)) {
> > + /*
> > + * Coherence order guarantees these consequtive stores to a
> > + * singe variable propagate in the specified order. Thus the
> > + * clear_bit() is ordered _after_ the set bit. See
> > + * blk_mq_check_expired().
> > + */
> > clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
>
> It could be useful to stress that set_bit(), clear_bit() must "act" on
> the same subword of the unsigned long (whatever "subword" means at this
> level...) to rely on the coherence order (c.f., alpha's implementation).

As I wrote to your initial reply (which I now saw was private) I went
through the architectures again and found h8300 to use byte ops to
implement all the bitops.

So subword here means byte :/

The last time we looked at this was for PG_waiter and back then I think
we settled on u32 (with Alpha for example using 32bit ll/sc ops). Linus
moved PG_waiters to the same byte though, so that is all fine.

b91e1302ad9b ("mm: optimize PageWaiters bit use for unlock_page()")

> > + if (time_after_eq(jiffies, deadline)) {
> > + if (!blk_mark_rq_complete(rq)) {
> > + /*
> > + * Relies on the implied MB from test_and_clear() to
> > + * order the COMPLETE load against the STARTED load.
> > + * Orders against the coherence order in
> > + * blk_mq_start_request().
>
> I understand "from test_and_set_bit()" (in blk_mark_rq_complete()) and
> that the interested cycle is:
>
> /* in blk_mq_start_request() */
> [STORE STARTED bit = 1 into atomic_flags]
> -->co [STORE COMPLETE bit = 0 into atomic_flags]
> /* in blk_mq_check_expired() */
> -->rf [LOAD COMPLETE bit = 0 from atomic_flags]
> -->po-loc [LOAD STARTED bit = 0 from atomic_flags]
> /* in blk_mq_start_request() again */
> -->fr [STORE STARTED bit = 1 into atomic_flags]
>
> (N.B. Assume all accesses happen to/from the same subword.)
>
> This cycle being forbidden by the "coherence check", I'd say we do not
> need to rely on the MB mentioned by the comment; what am I missing?

Nothing, I forgot about the read-after-read thing but did spot the MB.
Either one suffices to guarantee the order we need. It just needs to be
documented as being relied upon.