Re: __queue_work oops.

From: Tejun Heo
Date: Mon Feb 27 2017 - 13:40:16 EST


Hello,

On Mon, Feb 27, 2017 at 12:14:39PM -0500, Dave Jones wrote:
> Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.10.0-think+ #9
> task: ffff88017f105440 task.stack: ffffc90000094000
> RIP: 0010:__queue_work+0x2d/0x700
> RSP: 0018:ffff880507c03df8 EFLAGS: 00010046
> RAX: 0000000000000082 RBX: 0000000000000101 RCX: 0000000000000002
> RDX: ffff88047bf07c98 RSI: 0000000000000000 RDI: 0000000000000000
> RBP: ffff880507c03e30 R08: 0000000000000001 R09: ffffffff8294bf68
> R10: ffff880507c03e58 R11: 0000000000000000 R12: ffff88047bf07ce8
> R13: 0000000000000000 R14: 0000000000000000 R15: ffff88047bf07c98
> FS: 0000000000000000(0000) GS:ffff880507c00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00000000000001c2 CR3: 0000000004e11000 CR4: 00000000001406e0
> Call Trace:
> <IRQ>
> ? work_on_cpu+0xb0/0xb0
> delayed_work_timer_fn+0x1e/0x20
> call_timer_fn+0xbd/0x480
...
> Code starting with the faulting instruction
> ===========================================
> 0: 41 f6 85 c2 01 00 00 testb $0x1,0x1c2(%r13)
> 7: 01
> 8: 0f 85 22 04 00 00 jne 0x430
> e: 49 rex.WB
> f: bc eb 83 b5 80 mov $0x80b583eb,%esp
> 14: 46 rex.RX
>
> 0000000000003cf0 <__queue_work>:
> {
> 3cf0: e8 00 00 00 00 callq 3cf5 <__queue_work+0x5>
> 3cf5: 55 push %rbp
> 3cf6: 48 89 e5 mov %rsp,%rbp
> 3cf9: 41 57 push %r15
> 3cfb: 49 89 d7 mov %rdx,%r15
> 3cfe: 41 56 push %r14
> unsigned int req_cpu = cpu;
> 3d00: 41 89 fe mov %edi,%r14d
> {
> 3d03: 41 55 push %r13
> 3d05: 49 89 f5 mov %rsi,%r13
> 3d08: 41 54 push %r12
> 3d0a: 53 push %rbx
> 3d0b: 48 83 ec 10 sub $0x10,%rsp
> 3d0f: 89 7d d4 mov %edi,-0x2c(%rbp)
> asm volatile("# __raw_save_flags\n\t"
> 3d12: 9c pushfq
> 3d13: 58 pop %rax
> WARN_ON_ONCE(!irqs_disabled());
> 3d14: f6 c4 02 test $0x2,%ah
> 3d17: 0f 85 06 04 00 00 jne 4123 <__queue_work+0x433>
> if (unlikely(wq->flags & __WQ_DRAINING) &&
> 3d1d: 41 f6 85 c2 01 00 00 testb $0x1,0x1c2(%r13)
>
>
> So we called __queue_work with a null wq.

So, that's somebody calling queue_delayed_work[_on]() with a NULL wq
and when the timeout expires the timer callback trying to queue
against NULL. Hmm... the work function would be able to tell us who
queued it but it isn't part of the information dumped here (would be
0x18(%rdx)).

I'll add a sanity check on queue_delayed_work_on() so that we can
catch it synchronously when it happens.

Thanks.

--
tejun