Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

From: Tejun Heo
Date: Mon Jul 30 2018 - 15:14:28 EST


Hello, Michal.

On Mon, Jul 30, 2018 at 08:51:10PM +0200, Michal Hocko wrote:
> > Yeah, workqueue can choke on things like that and kthread indefinitely
> > busy looping doesn't do anybody any good.
>
> Yeah, I do agree. But this is much easier said than done ;) Sure
> we have that hack that does sleep rather than cond_resched in the
> page allocator. We can and will "fix" it to be unconditional in the
> should_reclaim_retry [1] but this whole thing is really subtle. It just
> take one misbehaving worker and something which is really important to
> run will get stuck.

Oh yeah, I'm not saying the current behavior is ideal or anything, but
since the behavior has been put in many years ago, it only became a
problem only a couple times and all cases were rather easy and obvious
fixes on the wq user side. It shouldn't be difficult to add a timer
mechanism on top. We might be able to simply extend the hang
detection mechanism to kick off all pending rescuers after detecting a
wq stall. I'm wary about making it a part of normal operation
(ie. silent timeout). per-cpu kworkers really shouldn't busy loop for
an extended period of time.

Thanks.

--
tejun