Re: [PATCH] fix-flush_workqueue-vs-cpu_dead-race-update

From: Oleg Nesterov
Date: Sun Jan 07 2007 - 07:55:47 EST


On 01/07, Srivatsa Vaddagiri wrote:
>
> On Sat, Jan 06, 2007 at 08:34:16PM +0300, Oleg Nesterov wrote:
> > I suspect this can't help either.
> >
> > The problem is that flush_workqueue() may be called while cpu hotplug event
> > in progress and CPU_DEAD waits for kthread_stop(), so we have the same dead
> > lock if work->func() does flush_workqueue(). This means that Andrew's change
> > to use preempt_disable() is good and anyway needed.
>
> Well ..a lock_cpu_hotplug() in run_workqueue() and support for recursive
> calls to lock_cpu_hotplug() by the same thread will avoid the problem
> you mention.

Srivatsa, I'm completely new to cpu-hotplug, so please correct me if I'm
wrong (in fact I _hope_ I am wrong) but as I see it, the hotplug/workqueue
interaction is broken by design, it can't be fixed by changing just locking.

Once again. CPU dies, CPU_DEAD calls kthread_stop() and sleeps until
cwq->thread exits. To do so, this thread must at least complete the
currently running work->func().

work->func() calls flush_workque(WQ), it does lock_cpu_hotplug() or
_whatever_. Now the question, does it block?

if YES:
This is what the stable tree does - deadlock.

if NOT:
This is what we have with Andrew's s/mutex_lock/preempt_disable/
patch - race or deadlock, we have a choice.

Suppose that WQ has pending works on that dead CPU. Note that
at this point this CPU does not present on cpu_online_map.
This means that (without other changes) we have lost.

- flush_workque(WQ) can't return until CPU_DEAD transfers
these works to some another CPU on the cpu_online_map.

- CPU_DEAD can't do take_over_work() untill flush_workque()
returns.

Andrew, Ingo, this also means that freezer can't solve this particular
problem either (if i am right).

Thoughts?

Oleg.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/