[Workqueue] crash in process_one_work

From: Arun KS
Date: Mon Sep 29 2014 - 12:10:57 EST


Hello Tejun/Lai,

I am seeing the following crash in 3.10.49 kernel.

[ 1133.893817] Unable to handle kernel NULL pointer dereference at
virtual address 00000004
[ 1133.893821] pgd = c0004000
[ 1133.893827] [00000004] *pgd=00000000
[ 1133.893834] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
[ 1133.893841] Modules linked in:
[ 1133.893849] CPU: 2 PID: 5359 Comm: kworker/u8:20 Not tainted
3.10.28-g99b6153-00006-gc32dab7 #1
[ 1133.893859] task: d8c2aa00 ti: e79a4000 task.ti: e79a4000
[ 1133.893873] PC is at process_one_work+0x18/0x448
[ 1133.893878] LR is at process_one_work+0x14/0x448
[ 1133.893887] pc : [<c0135218>] lr : [<c0135214>] psr: 400f0093
sp : e79a5ef8 ip : daf7f100 fp : 00000089
[ 1133.893891] r10: daf7f118 r9 : ee80e820 r8 : ee80e800
[ 1133.893897] r7 : c111872e r6 : ee80e800 r5 : ed7cf150 r4 : daf7f100
[ 1133.893902] r3 : ffffffe0 r2 : 00000081 r1 : ed7cf150 r0 : 00000000
[ 1133.893908] Flags: nZcv IRQs off FIQs on Mode SVC_32 ISA ARM
Segment kernel
[ 1133.893914] Control: 10c5383d Table: a7dbc06a DAC: 00000015

Pasting the code snippet of process_one_work function where crash happens,

struct pool_workqueue *pwq = get_work_pwq(work);
struct worker_pool *pool = worker->pool;
bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE;

get_work_pwq returned NULL because WORK_STRUCT_PWQ flag was not set on
work_struct->data. And the crash happened while dereferencing the NULL
pointer. There is no NULL check here, which signifies that this
condition must not have happened.

The corresponding work_struct looks likes this,

crash> struct work_struct ed7cf150
struct work_struct {
data = {
counter = 0xffffffe0
},
entry = {
next = 0xed7cf154,
prev = 0xed7cf154
},
func = 0xc0140ac4 <async_run_entry_fn>
}

The value of data is 0xffffffe0, which is basically the value after an
INIT_WORK() or WORK_DATA_INIT().
This can happen if a driver calls INIT_WORK on same struct work again
after queuing it.

>From the above details of the work_struct shows that the work is
queued from kernel/async.c. async_schedule dynamically allocates the
work_struct and queues it to system_unbonded_wq. And possibility of
calling INIT_WORK on same work is not there.

After inspecting ramdump for async_entry structure in kernel/async.c

crash> struct async_entry ed7cf140
struct async_entry {
domain_list = {
next = 0xed7cf140,
prev = 0xed7cf140
},
global_list = {
next = 0xed7cf148,
prev = 0xed7cf148
},
work = {
data = {
counter = 0xffffffe0
},
entry = {
next = 0xed7cf154,
prev = 0xed7cf154
},
func = 0xc0140ac4 <async_run_entry_fn>
},
cookie = 0x263e5,
func = 0xc074dda0 <dapm_post_sequence_async>,
data = 0xed48432c,
domain = 0xe5457dec
}

the func points to dapm_post_sequence_async. and you can see the
domain_list and global_list is empty. Which shows that the work has
finished execution and there is no pending execution in async.

But how come this struct work was with work queue data structures?
Is there any corner case in work queue which can miss unlinking the
struct_work from pool_workqueue after executing them?

I really appreciate your inputs/pointers.
Please let me know if you want any more information from the crashed system.

Thanks,
Arun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/