Re: [PATCH 8/8] blk-mq: drain I/O when all CPUs in a hctx are offline

From: Ming Lei
Date: Thu May 28 2020 - 01:19:55 EST


On Wed, May 27, 2020 at 08:33:48PM -0700, Bart Van Assche wrote:
> On 2020-05-27 18:46, Ming Lei wrote:
> > On Wed, May 27, 2020 at 04:09:19PM -0700, Bart Van Assche wrote:
> >> On 2020-05-27 11:06, Christoph Hellwig wrote:
> >>> --- a/block/blk-mq-tag.c
> >>> +++ b/block/blk-mq-tag.c
> >>> @@ -180,6 +180,14 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
> >>> sbitmap_finish_wait(bt, ws, &wait);
> >>>
> >>> found_tag:
> >>> + /*
> >>> + * Give up this allocation if the hctx is inactive. The caller will
> >>> + * retry on an active hctx.
> >>> + */
> >>> + if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state))) {
> >>> + blk_mq_put_tag(tags, data->ctx, tag + tag_offset);
> >>> + return -1;
> >>> + }
> >>> return tag + tag_offset;
> >>> }
> >>
> >> The code that has been added in blk_mq_hctx_notify_offline() will only
> >> work correctly if blk_mq_get_tag() tests BLK_MQ_S_INACTIVE after the
> >> store instructions involved in the tag allocation happened. Does this
> >> mean that a memory barrier should be added in the above function before
> >> the test_bit() call?
> >
> > Please see comment in blk_mq_hctx_notify_offline():
> >
> > + /*
> > + * Prevent new request from being allocated on the current hctx.
> > + *
> > + * The smp_mb__after_atomic() Pairs with the implied barrier in
> > + * test_and_set_bit_lock in sbitmap_get(). Ensures the inactive flag is
> > + * seen once we return from the tag allocator.
> > + */
> > + set_bit(BLK_MQ_S_INACTIVE, &hctx->state);
>
> From Documentation/atomic_bitops.txt: "Except for a successful
> test_and_set_bit_lock() which has ACQUIRE semantics and
> clear_bit_unlock() which has RELEASE semantics."

test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state) is called exactly after
one tag is allocated, that means test_and_set_bit_lock is successful before
the test_bit(). The ACQUIRE semantics guarantees that test_bit(BLK_MQ_S_INACTIVE)
is always done after successful test_and_set_bit_lock(), so tag bit is
always set before testing BLK_MQ_S_INACTIVE.

See Documentation/memory-barriers.txt:
(5) ACQUIRE operations.

This acts as a one-way permeable barrier. It guarantees that all memory
operations after the ACQUIRE operation will appear to happen after the
ACQUIRE operation with respect to the other components of the system.
ACQUIRE operations include LOCK operations and both smp_load_acquire()
and smp_cond_load_acquire() operations.

>
> My understanding is that operations that have acquire semantics pair
> with operations that have release semantics. I haven't been able to find
> any documentation that shows that smp_mb__after_atomic() has release
> semantics. So I looked up its definition. This is what I found:
>
> $ git grep -nH 'define __smp_mb__after_atomic'
> arch/ia64/include/asm/barrier.h:49:#define __smp_mb__after_atomic()
> barrier()
> arch/mips/include/asm/barrier.h:133:#define __smp_mb__after_atomic()
> smp_llsc_mb()
> arch/s390/include/asm/barrier.h:50:#define __smp_mb__after_atomic()
> barrier()
> arch/sparc/include/asm/barrier_64.h:57:#define __smp_mb__after_atomic()
> barrier()
> arch/x86/include/asm/barrier.h:83:#define __smp_mb__after_atomic() do {
> } while (0)
> arch/xtensa/include/asm/barrier.h:20:#define __smp_mb__after_atomic()
> barrier()
> include/asm-generic/barrier.h:116:#define __smp_mb__after_atomic()
> __smp_mb()
>
> My interpretation of the above is that not all smp_mb__after_atomic()
> implementations have release semantics. Do you agree with this conclusion?

I understand smp_mb__after_atomic() orders set_bit(BLK_MQ_S_INACTIVE)
and reading the tag bit which is done in blk_mq_all_tag_iter().

So the two pair of OPs are ordered:

1) if one request(tag bit) is allocated before setting BLK_MQ_S_INACTIVE,
the tag bit will be observed in blk_mq_all_tag_iter() from blk_mq_hctx_has_requests(),
so the request will be drained.

OR

2) if one request(tag bit) is allocated after setting BLK_MQ_S_INACTIVE,
the request(tag bit) will be released and retried on another CPU
finally, see __blk_mq_alloc_request().

Cc Paul and linux-kernel list.


Thanks,
Ming