Re: [PATCH v2 1/4] workqueue: park kicked worker on pool->kicked_list

Next message: Waiman Long: "[PATCH v5] debugobjects: Don't call fill_pool() in early boot hardirq context"
Previous message: Roberto Sassu: "[PATCH v7 08/12] ima: Introduce ima_dump_measurement()"
In reply to: Breno Leitao: "Re: [PATCH v2 1/4] workqueue: park kicked worker on pool-&gt;kicked_list"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

From: Tejun Heo

Date: Fri Jun 05 2026 - 13:35:22 EST

Hello,

On Fri, Jun 05, 2026 at 07:40:43AM -0700, Breno Leitao wrote:
...
> > > task state, so a kicked-but-not-yet-scheduled worker is still a valid
> > > cull victim -- the cull can reap it before it consumes the just-enqueued
> > > work, stranding the item. The window is narrow today but later patches
> > > in this series defer the wakeup outside pool->lock, widening it.
> >
> > Have you actually reproduced this?
>
> No -- not without artificially changing the timeout in the kernel. So far this
> is a theoretical race window rather than something I've hit in
> practice; it was flagged as a critical issue by sashiko:
>
> https://sashiko.dev/#/patchset/20260526-fastwake-v1-0-e69ad86923e6%40debian.org
>
> The only way I could get it to actually strand an item was by shrinking
> IDLE_WORKER_TIMEOUT 150,000x (300s -> 2ms), so that a worker counts as
> "timed out" almost immediately.

I see. This is a bug then. It shouldn't happen even with that.

> Would it make sense to refresh last_active when a worker is kicked? The
> cull walks tail->head and breaks at the first non-expired worker, so a
> freshly-stamped kicked worker would simply be skipped while genuinely
> old workers behind it are still reaped.

I don't think timestamping is where the problem is. The intention of the
code is that the idle thread minimum count + how workers transition their
states can't lead to a situation where a work item is pending without an
idle worker to execute it regardless of timing.

Can you instrument code with the lowered threshold and record the sequence
of events. If we record the sequence of work item and worker state
transitions, it should tell us what's broken. We shouldn't need to protect
all kicked workers to fix this.

Thanks.

--
tejun