Re: [PATCH] Revert "workqueue: re-add lockdep dependencies for flushing"

From: Johannes Berg
Date: Tue Oct 23 2018 - 15:27:01 EST


On Mon, 2018-10-22 at 17:43 -0700, Sagi Grimberg wrote:
> > > I must also say that I'm disappointed you'd try to do things this way.
> > > I'd be (have been?) willing to actually help you understand the problem
> > > and add the annotations, but rather than answer my question ("where do I
> > > find the right git tree"!) you just send a revert patch.
> >
> > Sorry that I had not yet provided that information. You should have
> > received this information through another e-mail thread. See also
> > http://lists.infradead.org/pipermail/linux-nvme/2018-October/020493.html.
> >
> > > To do that, you have to understand what recursion is valid (I'm guessing
> > > there's some sort of layering involved), and I'm far from understanding
> > > anything about the code that triggered this report.
> >
> > I don't think there is any kind of recursion involved in the NVMe code
> > that triggered the lockdep complaint. Sagi, please correct me if I got this
> > wrong.
>
> I commented on the original thread. I'm not sure it qualifies as a
> recursion, but in that use-case, when priv->handler_mutex is taken
> it is possible that other priv->handler_mutex instances are taken but
> are guaranteed not to belong to that priv...

I also commented over there regarding this specific problem.

I think my wording was inaccurate perhaps. It's not so much direct
recursion, but a dependency chain connecting the handler_mutex class
back to itself. This is how lockdep finds problems, it doesn't consider
actual *instances* but only *classes*.

As Sagi mentioned on the other thread, in this specific instance you
happen to know out-of-band that this is safe, (somehow) the code ensures
that you can't actually recursively connect back to yourself. Perhaps
there's some kind of layering involved, I don't know. (*)

As I said over there, this type of thing can be solved with
mutex_lock_nested(), though it's possible that in this case it should
really be flush_workqueue_nested() or flush_work_nested() or so.

johannes

(*) note that just checking "obj != self" might not be enough, depending
on how you pick candidates for "obj" you might recurse down another
level and end up back at yourself, just two levels down, and actually
have a dependency chain leading back to yourself.