Re: [PATCH 38/40] cifs: use workqueue instead of slow-work

From: Jeff Layton
Date: Tue Jan 19 2010 - 19:59:51 EST


On Wed, 20 Jan 2010 09:15:14 +0900
Tejun Heo <tj@xxxxxxxxxx> wrote:

> Hello,
>
> On 01/19/2010 09:20 PM, Jeff Layton wrote:
> >> @@ -584,13 +583,13 @@ is_valid_oplock_break(struct smb_hdr *buf, struct TCP_Server_Info *srv)
> >> pCifsInode->clientCanCacheAll = false;
> >> if (pSMB->OplockLevel == 0)
> >> pCifsInode->clientCanCacheRead = false;
> >> - rc = slow_work_enqueue(&netfile->oplock_break);
> >> - if (rc) {
> >> - cERROR(1, ("failed to enqueue oplock "
> >> - "break: %d\n", rc));
> >> - } else {
> >> - netfile->oplock_break_cancelled = false;
> >> - }
> >> +
> >> + cifs_oplock_break_get(netfile);
> >> + if (!queue_work(system_single_wq,
> >> + &netfile->oplock_break))
> >> + cifs_oplock_break_put(netfile);
> >> + netfile->oplock_break_cancelled = false;
> >> +
> >> read_unlock(&GlobalSMBSeslock);
> >> read_unlock(&cifs_tcp_ses_lock);
> >> return true;
> >
> > This block of code looks problematic. This code is run by the
> > cifs_demultiplex_thread (cifsd). We can't do an oplock_break_put in
> > this context, since it might trigger a blocking SMB and cause a
> > deadlock.
>
> Okay, thanks for pointing it out.
>
> > A while back, I backported this code to earlier kernels and used a
> > standard workqueue there. What I did there was to only do the "get" if
> > the queue_work succeeded, and then had the queued work take and
> > immediately drop the GlobalSMBSeslock first thing. Yes, it's ugly, but
> > it prevented the possible deadlock and didn't require adding anything
> > like completion vars to the struct.
>
> Hmmm... Why is locking GlobalSMBSeslock necessary?
> cifs_oplock_break_get() can never fail and it seems that
> is_valid_oplock_break() should be holding valid reference by the time
> it enqueues the work, so wouldn't the following be sufficient?
>
> if (queue_work(system_single_wq, &netfile->oplock_break))
> cifs_oplock_break_get(netfile);
>

I guess I wasn't sure I could count on that. It seems unlikely that the
work would run before you did the "get", but unlikely races are even
harder to troubleshoot when they do get hit.

FWIW, it's not a terribly hot codepath, so taking and dropping the
spinlock shouldn't be too bad for performance. If you're certain that
we don't need to worry about it though, then maybe we can just skip
doing that.

> > Also, this change seems to have changed the logic a bit. The
> > oplock_break_cancelled flag is being set to false unconditionally, and
> > the printk was dropped. Not a big deal on the last part, but we can't
> > really do much with errors in this codepath so it might be helpful to
> > have some indication that there are problems here.
>
> The thing is that slow_work_enqueue() can only fail if getting a
> reference fails. In cifs' case, it always succeeds so there's no
> failure case to handle there.
>

Ok, but here we're changing this to queue_work. Is that also guaranteed
to succeed here? If so, then dropping the printk is fine. If not, then
I think we should keep it in.

It's been a while since I overhauled this code, so I'll need to look
again at the semantics for the oplock_break_cancelled flag. It may be
ok to just set it unconditionally like this, but I'll need to check
and be sure.

--
Jeff Layton <jlayton@xxxxxxxxxx>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/