Re: [PATCH] drivers/base: use a worker for sysfs unbind
From: Daniel Vetter
Date: Wed Dec 12 2018 - 07:40:25 EST
On Wed, Dec 12, 2018 at 12:19 PM Greg Kroah-Hartman
<gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
>
> On Wed, Dec 12, 2018 at 12:08:40PM +0100, Daniel Vetter wrote:
> > On Mon, Dec 10, 2018 at 11:20:58AM +0100, Daniel Vetter wrote:
> > > On Mon, Dec 10, 2018 at 11:18:32AM +0100, Daniel Vetter wrote:
> > > > On Mon, Dec 10, 2018 at 11:06:34AM +0100, Greg Kroah-Hartman wrote:
> > > > > On Mon, Dec 10, 2018 at 09:46:53AM +0100, Daniel Vetter wrote:
> > > > > > Drivers might want to remove some sysfs files, which needs the same
> > > > > > locks and ends up angering lockdep. Relevant snippet of the stack
> > > > > > trace:
> > > > > >
> > > > > > kernfs_remove_by_name_ns+0x3b/0x80
> > > > > > bus_remove_driver+0x92/0xa0
> > > > > > acpi_video_unregister+0x24/0x40
> > > > > > i915_driver_unload+0x42/0x130 [i915]
> > > > > > i915_pci_remove+0x19/0x30 [i915]
> > > > > > pci_device_remove+0x36/0xb0
> > > > > > device_release_driver_internal+0x185/0x250
> > > > > > unbind_store+0xaf/0x180
> > > > > > kernfs_fop_write+0x104/0x190
> > > > > >
> > > > > > I've stumbled over this because some new patches by Ram connect the
> > > > > > snd-hda-intel unload (where we do use sysfs unbind) with the locking
> > > > > > chains in the i915 unload code (but without creating a new loop),
> > > > > > which upset our CI. But the bug is already there and can be easily
> > > > > > reproduced by unbind i915 directly.
> > > > >
> > > > > This is odd, why wouldn't any driver hit this issue? And why now since
> > > > > you say this is triggerable today?
> > > >
> > > > The above backtrace is triggered by unbinding i915 on current upstream
> > > > kernels. Note: Will crash later on rather badly in the
> > > > fbdev/fbcon/vtconsole hell, but that's separate issue (which can be worked
> > > > around by first unbinding fbcon manually through sysfs).
> > > >
> > > > > I know scsi was doing some strange things like trying to remove the
> > > > > device itself from a sysfs callback on the device, which requires it to
> > > > > just call a different kobject function created just for that type of
> > > > > thing. Would that also make sense to do here instead of your workqueue?
> > > >
> > > > Note how we blow up on unregistering sw device instances supported by i915
> > > > in entirely different subsystems. I guess most drivers just have sysfs
> > > > files for their own stuff, where this is done as you describe. The problem
> > > > is that there's an awful lot of unrelated stuff hanging off i915.
> > > >
> > > > Or maybe acpi_video is busted, and should be using a different function.
> > > > You haven't said which one, and I have no idea which one it is ...
> > > >
> > > > And in case the context wasn't clear: This is unbinding the i915 pci
> > > > driver which triggers the above lockdep splat recursion.
> > >
> > > btw another option for "fixing" this would be to annotate the mutex_lock
> > > in kernfs_remove_by_name_ns as recursive. Which just shuts up lockdep (and
> > > might hide some real bugs), but would get the job done since there's not
> > > actually a deadlock here. Just lockdep being annoyed.
> >
> > So what's the pick? I can do the typing, but I don't understand all the
> > driver core interactions to know what we should be doing here best.
>
> Sorry for the delay.
>
> Look at sdev_store_delete() in drivers/scsi/scsi_sysfs.c and see if the
> logic there makes sense to do here instead.
This looks interesting, but it doesn't solve the problem. The issue is
_not_ that we remove the same sysfs file as the one we're writing
into. It's that we're removing an entirely unrelated sysfs file, which
will not cause a deadlock per se, but triggers lockdep because it's in
the same locking class (note how the locking recusion is within one
callchain, this would deadlock right away if it's the same file, but
unloading happily continues).
> It still seems odd that removing a sysfs file by writing to a sysfs file
> at the same level really invokes lockdep as I would have thought that
> this path is well-tested by now.
Iirc has been around forever for gpu drivers. Just never bothered to
fix it, because there's much bigger issues in hotunplug for gpu
drivers. Only reason we use unbind in CI is because it's the simplest
way to get userspace off the snd-hda-intel driver (which needs to be
unloaded before i915, if you want to unload that).
-Daniel
>
> thanks,
>
> greg k-h
> _______________________________________________
> dri-devel mailing list
> dri-devel@xxxxxxxxxxxxxxxxxxxxx
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch