Re: [RFC PATCH] cifs: Fix possible deadlock with cifs and work queues

From: Jeff Layton
Date: Thu Mar 20 2014 - 17:03:26 EST

Next message: Jens Axboe: "[PATCH] block: free q->flush_rq in blk_init_allocated_queue error paths"
Previous message: Peter Wu: "Re: [PATCH 0/3] rtlwifi (and staging rtl8821ae) cleanups"
In reply to: Steven Rostedt: "Re: [RFC PATCH] cifs: Fix possible deadlock with cifs and work queues"
Next in thread: Steven Rostedt: "Re: [RFC PATCH] cifs: Fix possible deadlock with cifs and work queues"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, 20 Mar 2014 16:57:03 -0400
Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:

> On Thu, 20 Mar 2014 15:28:33 -0400
> Jeffrey Layton <jlayton@xxxxxxxxxx> wrote:
>
>
> > Nice analysis! I think eventually we'll need to overhaul this code not
>
> Note, Ulrich Obergfell helped a bit in the initial analysis. He found
> from a customer core dump that the kworker thread was blocked on the
> cinode->lock_sem, and the reader was blocked as well. That was enough
> for me to find where the problem laid.
>

Kudos to Uli, then ;)

> > to use rw semaphores, but that's going to take some redesign. (Wonder
> > if we could change it to use seqlocks or something?)
> >
> > Out of curiousity, does this eventually time out and unwedge itself?
> > Usually when the server doesn't get a response to an oplock break in
> > around a minute or so it gives up and allows the thing that caused the
> > oplock break to proceed anyway. Not great for performance but it out to
> > eventually make progress due to that.
>
> No, I believe it's hard locked. Nothing is going to wake up the oplock
> break if it is blocked on a down_read(). Only the release of the rwsem
> will do that. It's the subtle way the kworker threads are done.
>

Eventually the server should just allow the read to complete even if
the client doesn't respond to the oplock break. It has to since clients
can suddenly drop off the net while holding an oplock. That should
allow everything to unwedge eventually (though it may take a while).

If that's not happening then I'd be curious as to why...

> >
> > In any case, this looks like a reasonable fix for now, but I suspect you
> > can hit similar problems in the write codepath too. What may be best is
> > turn this around and queue the oplock break to the new workqueue
> > instead of the read completion job.
>
> Or perhaps give both the read and write their own workqueues? We have
> to look at all the work queue handlers, and be careful about any users
> that take the lock_sem, and separate them out.
>

Yeah, I haven't looked closely yet but I'm fairly sure that you could
hit the same situation in the write codepath as well. Whether adding
more workqueues will really help, I'm not sure of yet...

--
Jeff Layton <jlayton@xxxxxxxxxx>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Jens Axboe: "[PATCH] block: free q->flush_rq in blk_init_allocated_queue error paths"
Previous message: Peter Wu: "Re: [PATCH 0/3] rtlwifi (and staging rtl8821ae) cleanups"
In reply to: Steven Rostedt: "Re: [RFC PATCH] cifs: Fix possible deadlock with cifs and work queues"
Next in thread: Steven Rostedt: "Re: [RFC PATCH] cifs: Fix possible deadlock with cifs and work queues"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]