Re: [PATCH wq/for-4.0-fixes v2] workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for PREEMPT_NONE

From: Tomeu Vizoso
Date: Thu Mar 05 2015 - 04:25:23 EST


On 3 March 2015 at 14:57, Tomeu Vizoso <tomeu.vizoso@xxxxxxxxx> wrote:
> On 3 March 2015 at 14:45, Tejun Heo <tj@xxxxxxxxxx> wrote:
>> Hello,
>>
>> Found the culprit. Plain wake_up() shouldn't be used on
>> bit_waitqueues. The following patch should fix the issue. I replaced
>> the patch in the wq branches.
>
> Yup, this looks good from here.

Actually, I'm getting this when unloading mwifiex_sdio:

[ 317.201212] Unable to handle kernel paging request at virtual
address f081a680
[ 317.208522] pgd = ebb70000
[ 317.211302] [f081a680] *pgd=ab868811, *pte=00000000, *ppte=00000000
[ 317.217692] Internal error: Oops: 7 [#1] SMP ARM
[ 317.222328] Modules linked in: cfg80211(-) bluetooth ipv6 [last
unloaded: mwifiex]
[ 317.230019] CPU: 2 PID: 682 Comm: modprobe Not tainted
4.0.0-rc2-next-20150305ccu-00013-g2b3b1a8 #589
[ 317.239271] Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)
[ 317.245568] task: eb9cd580 ti: ebcea000 task.ti: ebcea000
[ 317.251001] PC is at bit_waitqueue+0x38/0x6c
[ 317.255312] LR is at __cancel_work_timer+0x28/0x1b0
[ 317.260215] pc : [<c028fe18>] lr : [<c0270d34>] psr: 20000113
[ 317.260215] sp : ebcebe90 ip : ee838000 fp : ebcebe9c
[ 317.271737] r10: 00000000 r9 : ebcea000 r8 : c0210ba4
[ 317.276991] r7 : 00000081 r6 : 2a07e3b0 r5 : bf134948 r4 : 00000000
[ 317.283552] r3 : 9e370001 r2 : 0000c640 r1 : e2692904 r0 : 000ff134
[ 317.290112] Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
[ 317.297286] Control: 10c5387d Table: abb7006a DAC: 00000015
[ 317.303060] Process modprobe (pid: 682, stack limit = 0xebcea220)
[ 317.309186] Stack: (0xebcebe90 to 0xebcec000)
[ 317.313568] be80: ebcebefc
ebcebea0 c0270d34 c028fdec
[ 317.321781] bea0: c05062d0 c034e57c c1210824 ebe94be0 ebcebedc
eb8f080c bf134830 2a07e3b0
[ 317.329998] bec0: 00000081 c0210ba4 ebcebef4 ebcebed8 c0506428
bf139830 bf1348dc 2a07e3b0
[ 317.338220] bee0: 00000081 c0210ba4 ebcea000 00000000 ebcebf0c
ebcebf00 c0270ed8 c0270d18
[ 317.346437] bf00: ebcebf34 ebcebf10 bf0ed138 c0270ec8 c11cb4f0
bf139830 bf134800 2a07e3b0
[ 317.354655] bf20: 00000081 c0210ba4 ebcebf4c ebcebf38 bf125184
bf0ed120 bf1396b8 2a07f76c
[ 317.362872] bf40: ebcebfa4 ebcebf50 c02c8b4c bf125158 ebcebf6c
38676663 31313230 00000000
[ 317.372482] bf60: ebcebf8c ebcebf70 c0273e98 c0adcdb0 ebcea010
c0210ba4 ebcebfb0 ebcea000
[ 317.382086] bf80: ebcebfac ebcebf90 c0214484 00273de0 2a07f738
2a07f738 00000000 ebcebfa8
[ 317.391706] bfa0: c0210a00 c02c89a4 2a07f738 2a07f738 2a07f76c
00000800 0fea3100 00000000
[ 317.401319] bfc0: 2a07f738 2a07f738 2a07e3b0 00000081 00000000
00000002 2a07e310 becc9db4
[ 317.410948] bfe0: 2a07d12c becc89ac 2a061623 b6eb82e6 60000030
2a07f76c 00000000 00000000
[ 317.420658] [<c028fe18>] (bit_waitqueue) from [<c0270d34>]
(__cancel_work_timer+0x28/0x1b0)
[ 317.430598] [<c0270d34>] (__cancel_work_timer) from [<c0270ed8>]
(cancel_work_sync+0x1c/0x20)
[ 317.440672] [<c0270ed8>] (cancel_work_sync) from [<bf0ed138>]
(regulatory_exit+0x24/0x17c [cfg80211])
[ 317.451396] [<bf0ed138>] (regulatory_exit [cfg80211]) from
[<bf125184>] (cfg80211_exit+0x38/0x4c [cfg80211])
[ 317.462726] [<bf125184>] (cfg80211_exit [cfg80211]) from
[<c02c8b4c>] (SyS_delete_module+0x1b4/0x1f8)
[ 317.473411] [<c02c8b4c>] (SyS_delete_module) from [<c0210a00>]
(ret_fast_syscall+0x0/0x34)

Regards,

Tomeu

> Thanks,
>
> Tomeu
>
>> Thanks a lot.
>>
>> ----- 8< -----
>> From bba5a377507efef69214108c0d3790cadfab21d4 Mon Sep 17 00:00:00 2001
>> From: Tejun Heo <tj@xxxxxxxxxx>
>> Date: Tue, 3 Mar 2015 08:43:09 -0500
>>
>> cancel[_delayed]_work_sync() are implemented using
>> __cancel_work_timer() which grabs the PENDING bit using
>> try_to_grab_pending() and then flushes the work item with PENDING set
>> to prevent the on-going execution of the work item from requeueing
>> itself.
>>
>> try_to_grab_pending() can always grab PENDING bit without blocking
>> except when someone else is doing the above flushing during
>> cancelation. In that case, try_to_grab_pending() returns -ENOENT. In
>> this case, __cancel_work_timer() currently invokes flush_work(). The
>> assumption is that the completion of the work item is what the other
>> canceling task would be waiting for too and thus waiting for the same
>> condition and retrying should allow forward progress without excessive
>> busy looping
>>
>> Unfortunately, this doesn't work if preemption is disabled or the
>> latter task has real time priority. Let's say task A just got woken
>> up from flush_work() by the completion of the target work item. If,
>> before task A starts executing, task B gets scheduled and invokes
>> __cancel_work_timer() on the same work item, its try_to_grab_pending()
>> will return -ENOENT as the work item is still being canceled by task A
>> and flush_work() will also immediately return false as the work item
>> is no longer executing. This puts task B in a busy loop possibly
>> preventing task A from executing and clearing the canceling state on
>> the work item leading to a hang.
>>
>> task A task B worker
>>
>> executing work
>> __cancel_work_timer()
>> try_to_grab_pending()
>> set work CANCELING
>> flush_work()
>> block for work completion
>> completion, wakes up A
>> __cancel_work_timer()
>> while (forever) {
>> try_to_grab_pending()
>> -ENOENT as work is being canceled
>> flush_work()
>> false as work is no longer executing
>> }
>>
>> This patch removes the possible hang by updating __cancel_work_timer()
>> to explicitly wait for clearing of CANCELING rather than invoking
>> flush_work() after try_to_grab_pending() fails with -ENOENT. The
>> explicit wait uses the matching bit waitqueue for the CANCELING bit.
>>
>> Link: http://lkml.kernel.org/g/20150206171156.GA8942@xxxxxxxx
>>
>> v2: v1 used wake_up() on bit_waitqueue() which leads to NULL deref if
>> the target bit waitqueue has wait_bit_queue's on it. Use
>> DEFINE_WAIT_BIT() and __wake_up_bit() instead. Reported by Tomeu
>> Vizoso.
>>
>> Signed-off-by: Tejun Heo <tj@xxxxxxxxxx>
>> Reported-by: Rabin Vincent <rabin.vincent@xxxxxxxx>
>> Cc: Tomeu Vizoso <tomeu.vizoso@xxxxxxxxx>
>> Cc: stable@xxxxxxxxxxxxxxx
>> Tested-by: Jesper Nilsson <jesper.nilsson@xxxxxxxx>
>> Tested-by: Rabin Vincent <rabin.vincent@xxxxxxxx>
>> ---
>> include/linux/workqueue.h | 3 ++-
>> kernel/workqueue.c | 37 +++++++++++++++++++++++++++++++++----
>> 2 files changed, 35 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
>> index 74db135..f597846 100644
>> --- a/include/linux/workqueue.h
>> +++ b/include/linux/workqueue.h
>> @@ -70,7 +70,8 @@ enum {
>> /* data contains off-queue information when !WORK_STRUCT_PWQ */
>> WORK_OFFQ_FLAG_BASE = WORK_STRUCT_COLOR_SHIFT,
>>
>> - WORK_OFFQ_CANCELING = (1 << WORK_OFFQ_FLAG_BASE),
>> + __WORK_OFFQ_CANCELING = WORK_OFFQ_FLAG_BASE,
>> + WORK_OFFQ_CANCELING = (1 << __WORK_OFFQ_CANCELING),
>>
>> /*
>> * When a work item is off queue, its high bits point to the last
>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>> index f288493..cfbae1d 100644
>> --- a/kernel/workqueue.c
>> +++ b/kernel/workqueue.c
>> @@ -2730,17 +2730,37 @@ EXPORT_SYMBOL_GPL(flush_work);
>>
>> static bool __cancel_work_timer(struct work_struct *work, bool is_dwork)
>> {
>> + wait_queue_head_t *waitq = bit_waitqueue(&work->data,
>> + __WORK_OFFQ_CANCELING);
>> unsigned long flags;
>> int ret;
>>
>> do {
>> ret = try_to_grab_pending(work, is_dwork, &flags);
>> /*
>> - * If someone else is canceling, wait for the same event it
>> - * would be waiting for before retrying.
>> + * If someone else is already canceling, wait for it to
>> + * finish. flush_work() doesn't work for PREEMPT_NONE
>> + * because we may get scheduled between @work's completion
>> + * and the other canceling task resuming and clearing
>> + * CANCELING - flush_work() will return false immediately
>> + * as @work is no longer busy, try_to_grab_pending() will
>> + * return -ENOENT as @work is still being canceled and the
>> + * other canceling task won't be able to clear CANCELING as
>> + * we're hogging the CPU.
>> + *
>> + * Explicitly wait for completion using a bit waitqueue.
>> + * We can't use wait_on_bit() as the CANCELING bit may get
>> + * recycled to point to pwq if @work gets re-queued.
>> */
>> - if (unlikely(ret == -ENOENT))
>> - flush_work(work);
>> + if (unlikely(ret == -ENOENT)) {
>> + DEFINE_WAIT_BIT(wait, &work->data,
>> + __WORK_OFFQ_CANCELING);
>> + prepare_to_wait(waitq, &wait.wait,
>> + TASK_UNINTERRUPTIBLE);
>> + if (work_is_canceling(work))
>> + schedule();
>> + finish_wait(waitq, &wait.wait);
>> + }
>> } while (unlikely(ret < 0));
>>
>> /* tell other tasks trying to grab @work to back off */
>> @@ -2749,6 +2769,15 @@ static bool __cancel_work_timer(struct work_struct *work, bool is_dwork)
>>
>> flush_work(work);
>> clear_work_data(work);
>> +
>> + /*
>> + * Paired with prepare_to_wait() above so that either
>> + * __wake_up_bit() sees busy waitq here or !work_is_canceling() is
>> + * visible there.
>> + */
>> + smp_mb();
>> + __wake_up_bit(waitq, &work->data, __WORK_OFFQ_CANCELING);
>> +
>> return ret;
>> }
>>
>> --
>> 2.1.0
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/