Re: [PATCH 1/1] [RFC] workqueue: fix ghost PENDING flag while doing MQ IO

From: Roman Penyaev
Date: Mon Apr 25 2016 - 12:35:11 EST


Hello, Tejun,

On Mon, Apr 25, 2016 at 5:48 PM, Tejun Heo <tj@xxxxxxxxxx> wrote:
> Hello, Roman.
>
> On Mon, Apr 25, 2016 at 05:22:51PM +0200, Roman Pen wrote:
> ...
>> CPU#6 CPU#2
>> reqeust ffff884000343600 inserted
>> hctx marked as pended
>> kblockd_schedule...() returns 1
>> <schedule to kblockd workqueue>
>> *** WORK_STRUCT_PENDING_BIT is cleared ***
>> flush_busy_ctxs() is executed
>> reqeust ffff884000343cc0 inserted
>> hctx marked as pended
>> kblockd_schedule...() returns 0
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> WTF?
>>
>> As a result ffff884000343cc0 request pended forever.
>>
>> According to the trace output I see that another CPU _always_ observes
>> WORK_STRUCT_PENDING_BIT as set for that hctx->run_work, even it was
>> cleared on another CPU.
>>
>> Checking the workqueue.c code I see that clearing the bit is nothing
>> more, but atomic_long_set(), which is <mov> instruction. This
>> function:
>>
>> static inline void set_work_data()
>>
>> In attempt to "fix" the mystery I replaced atomic_long_set() call with
>> atomic_long_xchg(), which is <xchg> instruction.
>>
>> The problem has gone.
>>
>> For me it looks that test_and_set_bit() (<lock btsl> instruction) does
>> not require flush of all CPU caches, which can be dirty after executing
>> of <mov> on another CPU. But <xchg> really updates the memory and the
>> following execution of <lock btsl> observes that bit was cleared.
>>
>> As a conculusion I can say, that I am lucky enough and can reproduce
>> this bug in several minutes on a specific load (I tried many other
>> simple loads using fio, even using btrecord/btreplay, no success).
>> And that easy reproduction on a specific load gives me some freedom
>> to test and then to be sure, that problem has gone.
>
> Heh, excellent debugging. I wonder how old this bug is. cc'ing David
> Howells who ISTR to have reported a similar issue. The root problem
> here, I think, is that PENDING is used to synchronize between
> different queueing instances but we don't have proper memory barrier
> after it.
>
> A B
>
> queue (test_and_sets PENDING)
> dispatch (clears PENDING)
> execute queue (test_and_sets PENDING)
>
> So, for B, the guarantee must be that either A starts executing after
> B's test_and_set or B's test_and_set succeeds; however, as we don't
> have any memory barrier between dispatch and execute, there's nothing
> preventing the processor from scheduling some memory fetch operations
> from the execute stage before the clearing of PENDING - ie. A might
> not see what B has done prior to queue even if B's test_and_set fails
> indicating that A should. Can you please test whether the following
> patch fixes the issue?

I can assure you that smp_mb() helps (at least running for 30 minutes
under IO). That was my first variant, but I did not like it because I
could not explain myself why:

1. not smp_wmb()? We need to do flush after an update.
(I tried that also, and it does not help)

2. what protects us from this situation?

CPU#0 CPU#1
set_work_data()
test_and_set_bit()
smp_mb()

And 2. question was crucial to me, because even tiny delay "fixes" the
problem, e.g. ndelay also "fixes" the bug:

smp_wmb();
set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
+ ndelay(40);
}

Why ndelay(40)? Because on this machine smp_mb() takes 40 ns on average.

--
Roman

>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 2232ae3..8ec2b5e 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -666,6 +666,7 @@ static void set_work_pool_and_clear_pending(struct work_struct *work,
> */
> smp_wmb();
> set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
> + smp_mb();
> }
>
> static void clear_work_data(struct work_struct *work)
>
>
> --
> tejun