re. Spurious wakeup on a newly created kthread

From: Tejun Heo
Date: Sat Jun 25 2022 - 01:00:56 EST


Hello,

cc'ing random assortment of ppl who touched kernel/kthread.c and
others who would know better.

So, Petr debugged a NULL deref in workqueue code to a spurious wakeup
on a newly created kthread. The abbreviated patch description follows.
The original message is at

http://lkml.kernel.org/r/20220622140853.31383-1-pmladek@xxxxxxxx

On Wed, Jun 22, 2022 at 04:08:53PM +0200, Petr Mladek wrote:
> A system crashed with the following BUG() report:
>
> [115147.050484] BUG: kernel NULL pointer dereference, address: 0000000000000000
...
> [115147.050524] Call Trace:
> [115147.050533] worker_thread+0xb4/0x3c0
> [115147.050540] kthread+0x152/0x170
> [115147.050544] ret_from_fork+0x35/0x40
>
> Further debugging shown that the worker thread was woken
> before worker_attach_to_pool() finished in create_worker().
>
> Any kthread is supposed to stay in TASK_UNINTERRUPTIBLE sleep
> until it is explicitly woken. But a spurious wakeup might
> break this expectation.
>
> As a result, worker_thread() might read worker->pool before
> it was set in worker create_worker() by worker_attach_to_pool().
> Also manage_workers() might want to create yet another worker
> before worker->pool->nr_workers is updated. It is a kind off
> a chicken & egg problem.

tl;dr is that the worker creation code expects a newly created worker
kthread to sit tight until the creator finishes setting up stuff and
sends the initial wakeup. However, something, which wasn't identified
in the report (Petr, it'd be great if you can find out who did the
wakeup), wakes up the new kthread before the creation path is done
with init which causes the new kthread to try to deref a NULL pointer.

Petr fixed the problem by adding an extra handshake step so that the
new kthread explicitly waits for the creation path, which is fine, but
the picture isn't making sense to me.

* Are spurious wakeups allowed? The way that we do set_current_state()
in every iteration in wait_event() seems to suggest that we expect
someone to spuriously flip task state to RUNNING.

* However, if we're to expect spurious wakeups for anybody anytime,
why does a newly created kthread bother with
schedule_preempt_disabled() in kernel/kthread.c::kthread() at all?
It can't guarantee anything and all it does is masking subtle bugs.

What am I missing here?

Thanks.

--
tejun