Re: [RFC/PATCH] debug workqueue deadlocks with lockdep

From: Johannes Berg
Date: Mon Jul 02 2007 - 04:37:42 EST


On Sat, 2007-06-30 at 15:46 +0400, Oleg Nesterov wrote:
> On 06/30, Ingo Molnar wrote:
> >
> > On Thu, 2007-06-28 at 19:33 +0200, Johannes Berg wrote:
> > > No, that's not right either, but Arjan just helped me a bit with how
> > > lockdep works and I think I have the right idea now. Ignore this for
> > > now, I'll send a new patch in a few days.
> >
> > ok. But in general, this is a very nice idea!
> >
> > i've Cc:-ed Oleg. Oleg, what do you think? I think we should keep all
> > the workqueue APIs specified in a form that makes them lockdep coverable
> > like Johannes did. This debug mechanism could have helped with the
> > recent DVB lockup that Thomas Sattler reported.
>
> I think this idea is great!

:)

> Johannes, could you change wait_on_work() as well? Most users of
> flush_workqueue() should be converted to use it.

Sure. The case I had used flush_workqueue() but I'll look into
wait_on_work() and maybe converting the place where I had this.

> > @@ -342,6 +351,9 @@ static int flush_cpu_workqueue(struct cp
> > } else {
> > struct wq_barrier barr;
> >
> > + lock_acquire(&cwq->wq->lockdep_map, 0, 0, 0, 2, _THIS_IP_);
> > + lock_release(&cwq->wq->lockdep_map, 0, _THIS_IP_);
> > +
> > active = 0;
> > spin_lock_irq(&cwq->lock);
>
> I am not sure why you skip "if (cwq->thread == current)" case, it can
> deadlock in the same way.

I wasn't sure what would happen with recursion.

> But, perhaps we should not change flush_cpu_workqueue(). If we detect the
> deadlock, we will have num_online_cpus() reports, yes?

Not sure what you're thinking of here. I initially had it in
flush_workqueue() but then put it into flush_cpu_workqueue(), but I have
to admit that I already forgot why.

> And,
>
> > if (!list_empty(&cwq->worklist) || cwq->current_work != NULL) {
> > @@ -376,6 +388,8 @@ void fastcall flush_workqueue(struct wor
> > int cpu;
> >
> > might_sleep();
> > + lock_acquire(&wq->lockdep_map, 0, 0, 0, 2, _THIS_IP_);
> > + lock_release(&wq->lockdep_map, 0, _THIS_IP_);
>
> one of the 2 callers was already modified. Perhaps it is better to add
> lock_acquire() into the second caller, cleanup_workqueue_thread(), but
> skip flush_cpu_workqueue() ?

I'll have to look at the code.

The problem I discussed with Arjan is that with this patch all
workqueues are in the same class which is wrong, so I'll have to modify
the workqueue creation API by introducing some macros for the LOCKDEP
case that pass the current code spot as the workqueue class. I'll try to
do that later today.

johannes

Attachment: signature.asc
Description: This is a digitally signed message part