Re: [PATCH 1/2] driver core: Introduce device_link_wait_removal()

From: Saravana Kannan
Date: Thu Feb 29 2024 - 18:27:13 EST


On Fri, Feb 23, 2024 at 2:41 AM Nuno Sá <noname.nuno@xxxxxxxxx> wrote:
>
> On Fri, 2024-02-23 at 10:11 +0100, Herve Codina wrote:
> > Hi Saravana,
> >
> > On Tue, 20 Feb 2024 16:31:13 -0800
> > Saravana Kannan <saravanak@xxxxxxxxxx> wrote:
> >
> > ...
> >
> > > > +void device_link_wait_removal(void)
> > > > +{
> > > > + /*
> > > > + * devlink removal jobs are queued in the dedicated work queue.
> > > > + * To be sure that all removal jobs are terminated, ensure that
> > > > any
> > > > + * scheduled work has run to completion.
> > > > + */
> > > > + drain_workqueue(fw_devlink_wq);
> > >
> > > Is there a reason this needs to be drain_workqueu() instead of
> > > flush_workqueue(). Drain is a stronger guarantee than we need in this
> > > case. All we are trying to make sure is that all the device link
> > > remove work queued so far have completed.
> >
> > I used drain_workqueue() because drain_workqueue() allows for jobs already
> > present in a workqueue to re-queue a job and drain_workqueue() will wait
> > also for this new job completion.
> >
> > I think flush_workqueue() doesn't wait for this chain queueing.
> >
> > In our case, my understanding was that device_link_release_fn() calls
> > put_device() for the consumer and the supplier.
> > If refcounts reaches zero, devlink_dev_release() can be called again
> > and re-queue a job.
> >
>
> Looks sensible. The only doubt (that Saravana mays know better) is that I'm not
> sure put_device() on a supplier or consumer can actually lead to
> devlink_dev_release(). AFAIU, a consumer or a supplier should not be a device
> from the devlink class. Hence, looking at device_release(), I'm not sure it can
> happen unless for some odd reason someone is messing with devlinks in .remove()
> or .type->remove().

The case we are trying to fix here involves a supplier or a consumer
device (say Device-A) being device_del(). When that happens, all the
device links to/from the device are deleted by a call to
device_links_purge() since a device link can't exist without both the
supplier and consumer existing.

The problem you were hitting is that the device link deletion code
does the put_device(Device-A) in a workqueue. You change is to make
sure to wait until that has completed. To do that, you only need to
wait for the device link deletion work (already queued before
returning from device_del()) to finish. You don't need to wait for
anything more.

I read up on drain_workqueue() before I made my comments. The point I
was trying to make is that there could be some unrelated device link
deletions that you don't need to wait on.

But taking a closer look[1], it looks like drain_workqueue() might
actually cause bugs because while a workqueue is being drained, if
another unrelated device link deletion is trying to queue work, that
will get ignored.

Reply to rest of the emails in this thread here:

Nuno,

Sorry if I messed up who sent the first patch, but I did dig back to
your v1. But I could be wrong.

If devlink_dev_release() could have done the work synchronously, we'd
have done it in the first place. It's actually a bug because
devlink_dev_release() gets called in atomic context but the
put_device() on the supplier/consumer can do some sleeping work.

-Saravana

[1] - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/workqueue.c#n1727