Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

From: Balbir Singh
Date: Mon Oct 10 2016 - 07:18:35 EST




On 10/10/16 21:22, Michael Ellerman wrote:
> Hi Tejun,
>
> Tejun Heo <tj@xxxxxxxxxx> writes:
>> From f85002f627f7fdc7b3cda526863f5c9a8d36b997 Mon Sep 17 00:00:00 2001
>> From: Tejun Heo <tj@xxxxxxxxxx>
>> Date: Fri, 16 Sep 2016 15:49:32 -0400
>> Subject: [PATCH] workqueue: make workqueue available early during boot
>>
>> Workqueue is currently initialized in an early init call; however,
>> there are cases where early boot code has to be split and reordered to
>> come after workqueue initialization or the same code path which makes
>> use of workqueues is used both before workqueue initailization and
>> after. The latter cases have to gate workqueue usages with
>> keventd_up() tests, which is nasty and easy to get wrong.
>>
>> Workqueue usages have become widespread and it'd be a lot more
>> convenient if it can be used very early from boot. This patch splits
>> workqueue initialization into two steps. workqueue_init_early() which
>> sets up the basic data structures so that workqueues can be created
>> and work items queued, and workqueue_init() which actually brings up
>> workqueues online and starts executing queued work items. The former
>> step can be done very early during boot once memory allocation,
>> cpumasks and idr are initialized. The latter right after kthreads
>> become available.
>>
>> This allows work item queueing and canceling from very early boot
>> which is what most of these use cases want.
>>
>> * As systemd_wq being initialized doesn't indicate that workqueue is
>> fully online anymore, update keventd_up() to test wq_online instead.
>> The follow-up patches will get rid of all its usages and the
>> function itself.
>>
>> * Flushing doesn't make sense before workqueue is fully initialized.
>> The flush functions trigger WARN and return immediately before fully
>> online.
>>
>> * Work items are never in-flight before fully online. Canceling can
>> always succeed by skipping the flush step.
>>
>> * Some code paths can no longer assume to be called with irq enabled
>> as irq is disabled during early boot. Use irqsave/restore
>> operations instead.
>>
>> v2: Watchdog init, which requires timer to be running, moved from
>> workqueue_init_early() to workqueue_init().
>>
>> Signed-off-by: Tejun Heo <tj@xxxxxxxxxx>
>> Suggested-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
>> Link: http://lkml.kernel.org/r/CA+55aFx0vPuMuxn00rBSM192n-Du5uxy+4AvKa0SBSOVJeuCGg@xxxxxxxxxxxxxx
>
>
> This patch seems to be causing one of my Power8 boxes not to boot.
>
> Specifically commit 3347fa092821 ("workqueue: make workqueue available
> early during boot") in linux-next.
>
> If I revert this on top of next-20161005 then the machine boots again.
>
> I've attached the oops below. It looks like the cfs_rq of p->se is NULL?
>
> cheers
>
>
> bootconsole [udbg0] disabled
> bootconsole [udbg0] disabled
> mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
> pid_max: default: 163840 minimum: 1280
> Dentry cache hash table entries: 16777216 (order: 11, 134217728 bytes)
> Inode-cache hash table entries: 8388608 (order: 10, 67108864 bytes)
> Mount-cache hash table entries: 262144 (order: 5, 2097152 bytes)
> Mountpoint-cache hash table entries: 262144 (order: 5, 2097152 bytes)
> Unable to handle kernel paging request for data at address 0x00000038
> Faulting instruction address: 0xc0000000000fc0cc
> Oops: Kernel access of bad area, sig: 11 [#1]
> SMP NR_CPUS=2048 NUMA PowerNV
> Modules linked in:
> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.8.0-compiler_gcc-6.2.0-next-20161005 #94
> task: c0000007f5400000 task.stack: c000001ffc084000
> NIP: c0000000000fc0cc LR: c0000000000ed928 CTR: c0000000000fbfd0
> REGS: c000001ffc087780 TRAP: 0300 Not tainted (4.8.0-compiler_gcc-6.2.0-next-20161005)
> MSR: 9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE> CR: 48000424 XER: 00000000
> CFAR: c0000000000089dc DAR: 0000000000000038 DSISR: 40000000 SOFTE: 0
> GPR00: c0000000000ed928 c000001ffc087a00 c000000000e63200 c000000010d6d600
> GPR04: c0000007f5409200 0000000000000021 000000000748e08c 000000000000001f
> GPR08: 0000000000000000 0000000000000021 000000000748f1f8 0000000000000000
> GPR12: 0000000028000422 c00000000fb80000 c00000000000e0c8 0000000000000000
> GPR16: 0000000000000000 0000000000000000 0000000000000021 0000000000000001
> GPR20: ffffffffafb50401 0000000000000000 c000000010d6d600 000000000000ba7e
> GPR24: 000000000000ba7e c000000000d8bc58 afb504000afb5041 0000000000000001
> GPR28: 0000000000000000 0000000000000004 c0000007f5409280 0000000000000000
> NIP [c0000000000fc0cc] enqueue_task_fair+0xfc/0x18b0
> LR [c0000000000ed928] activate_task+0x78/0xe0
> Call Trace:
> [c000001ffc087a00] [c0000007f5409200] 0xc0000007f5409200 (unreliable)
> [c000001ffc087b10] [c0000000000ed928] activate_task+0x78/0xe0
> [c000001ffc087b50] [c0000000000ede58] ttwu_do_activate+0x68/0xc0
> [c000001ffc087b90] [c0000000000ef1b8] try_to_wake_up+0x208/0x4f0
> [c000001ffc087c10] [c0000000000d3484] create_worker+0x144/0x250
> [c000001ffc087cb0] [c000000000cd72d0] workqueue_init+0x124/0x150
> [c000001ffc087d00] [c000000000cc0e74] kernel_init_freeable+0x158/0x360
> [c000001ffc087dc0] [c00000000000e0e4] kernel_init+0x24/0x160
> [c000001ffc087e30] [c00000000000bfa0] ret_from_kernel_thread+0x5c/0xbc
> Instruction dump:
> 62940401 3b800000 3aa00000 7f17c378 3a600001 3b600001 60000000 60000000
> 60420000 72490021 ebfe0150 2f890001 <ebbf0038> 419e0de0 7fbee840 419e0e58
> ---[ end trace 0000000000000000 ]---
>
>
> c0000000000fbfd0 <enqueue_task_fair>:
> r4 = p
> ...
> c0000000000fc040: 80 00 c4 3b addi r30,r4,128 r30 = r4 + 128 (&p->se)
> ...
> c0000000000fc0c4: 50 01 fe eb ld r31,336(r30) r31 = *(r30 + 336) = se->cfs_rq
> c0000000000fc0c8: 01 00 89 2f cmpwi cr7,r9,1
> c0000000000fc0cc: 38 00 bf eb ld r29,56(r31) r29 = cfs_rq->curr
>

I think there is a race

rest_init()
{
...
kernel_thread(kernel_init, NULL, CLONE_FS);
numa_default_policy();
pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
rcu_read_lock();
kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
...

}

create_worker() needs kthreadd, it wakes up kthreadd in kthread_create_on_node,
workqueue_init() is called from kernel_init() , but kthreadd is created after
the call to kernel_init(), so its touch and go

Balbir Singh.