Re: [PATCH 1/4] workqueue: Reap workers via kthread_stop() and remove detach_completion

From: Lai Jiangshan
Date: Tue Sep 10 2024 - 23:32:40 EST


On Wed, Sep 11, 2024 at 11:23 AM Lai Jiangshan <jiangshanlai@xxxxxxxxx> wrote:
>
> Hello, Marc
>
> On Wed, Sep 11, 2024 at 12:29 AM Marc Hartmayer <mhartmay@xxxxxxxxxxxxx>
> > Code starting with the faulting instruction
> > ===========================================
> > 000002d8c205ef20: a7180000 lhi %r1,0
> > #000002d8c205ef24: 582083ac l %r2,940(%r8)
> > >000002d8c205ef28: ba12a000 cs %r1,%r2,0(%r10)
> > 000002d8c205ef2c: a77400cf brc 7,000002d8c205f0ca
> > 000002d8c205ef30: 5800b078 l %r0,120(%r11)
> > 000002d8c205ef34: a7010002 tmll %r0,2
> > 000002d8c205ef38: a77400d4 brc 7,000002d8c205f0e0
> > [ 14.271766] Call Trace:
> > [ 14.271769] worker_thread (./arch/s390/include/asm/atomic_ops.h:198 ./arch/s390/include/asm/spinlock.h:61 ./arch/s390/include/asm/spinlock.h:66 ./include/linux/spinlock.h:187 ./include/linux/spinlock_api_smp.h:120 kernel/workqueue.c:3346)
> > [ 14.271774] worker_thread (./arch/s390/include/asm/lowcore.h:226 ./arch/s390/include/asm/spinlock.h:61 ./arch/s390/include/asm/spinlock.h:66 ./include/linux/spinlock.h:187 ./include/linux/spinlock_api_smp.h:120 kernel/workqueue.c:3346)
> > [ 14.271777] kthread (kernel/kthread.c:389)
> > [ 14.271781] __ret_from_fork (arch/s390/kernel/process.c:62)
> > [ 14.271784] ret_from_fork (arch/s390/kernel/entry.S:309)
> > [ 14.271806] Last Breaking-Event-Address:
> > [ 14.271807] mutex_unlock (kernel/locking/mutex.c:549)
> >
> > So it seems to me that `worker->pool` is NULL in the
> > `workqueue.c:worker_thread` function and this leads to the crash.
> >
>
> I'm not familiar with s390 asm code, but it might be the case that
> `worker->pool` is NULL in the in worker_thread() since detach_worker()
> resets worker->pool to NULL.
>
> If it is the case, READ_ONCE(worker->pool) should be used in worker_thread()
> to fix the problem.
>
> (It is weird to me if worker->pool is read multi-time in worker_thread()
> since it is used many times, but since READ_ONCE() is not used, it can
> be possible).

Oh, it can be possible that the worker is created and then destroyed before
being waken-up. And if it is the case, READ_ONCE() won't help. I'm going to
explore if "worker->pool = NULL;" can be moved out from detach_worker().

Thanks
Lai