Re: [PATCH v5 1/8] drivers: kunit: Generic helpers for test device creation

From: Maxime Ripard
Date: Wed Mar 29 2023 - 15:45:03 EST


Hi David,

On Sat, Mar 25, 2023 at 01:40:01PM +0800, David Gow wrote:
> On Fri, 24 Mar 2023 at 20:32, Maxime Ripard <maxime@xxxxxxxxxx> wrote:
> >
> > On Fri, Mar 24, 2023 at 08:11:52AM +0200, Matti Vaittinen wrote:
> > > On 3/23/23 18:36, Maxime Ripard wrote:
> > > > On Thu, Mar 23, 2023 at 03:02:03PM +0200, Matti Vaittinen wrote:
> > > > > On 3/23/23 14:29, Maxime Ripard wrote:
> > > > > > On Thu, Mar 23, 2023 at 02:16:52PM +0200, Matti Vaittinen wrote:
> > > > > >
> > > > > > This is the description of what was happening:
> > > > > > https://lore.kernel.org/dri-devel/20221117165311.vovrc7usy4efiytl@houat/
> > > > >
> > > > > Thanks Maxime. Do I read this correcty. The devm_ unwinding not being done
> > > > > when root_device_register() is used is not because root_device_unregister()
> > > > > would not trigger the unwinding - but rather because DRM code on top of this
> > > > > device keeps the refcount increased?
> > > >
> > > > There's a difference of behaviour between a root_device and any device
> > > > with a bus: the root_device will only release the devm resources when
> > > > it's freed (in device_release), but a bus device will also do it in
> > > > device_del (through bus_remove_device() -> device_release_driver() ->
> > > > device_release_driver_internal() -> __device_release_driver() ->
> > > > device_unbind_cleanup(), which are skipped (in multiple places) if
> > > > there's no bus and no driver attached to the device).
> > > >
> > > > It does affect DRM, but I'm pretty sure it will affect any framework
> > > > that deals with device hotplugging by deferring the framework structure
> > > > until the last (userspace) user closes its file descriptor. So I'd
> > > > assume that v4l2 and cec at least are also affected, and most likely
> > > > others.
> > >
> > > Thanks for the explanation and patience :)
> > >
> > > >
> > > > > If this is the case, then it sounds like a DRM specific issue to me.
> > > >
> > > > I mean, I guess. One could also argue that it's because IIO doesn't
> > > > properly deal with hotplugging.
> > >
> > > I must say I haven't been testing the IIO registration API. I've only tested
> > > the helper API which is not backed up by any "IIO device". (This is fine for
> > > the helper because it must by design be cleaned-up only after the
> > > IIO-deregistration).
> > >
> > > After your explanation here, I am not convinced IIO wouldn't see the same
> > > issue if I was testing the devm_iio_device_alloc() & co.
> >
> > It depends really. The issue DRM is trying to solve is that, when a
> > device is gone, some application might still have an open FD and could
> > still poke into the kernel, while all the resources would have been
> > free'd if it was using devm.
> >
> > So everything is kept around until the last fd is closed, so you still
> > have a reference to the device (even though it's been removed from its
> > bus) until that time.
> >
> > It could be possible that IIO just doesn't handle that case at all. I
> > guess most of the devices aren't hotpluggable, and there's not much to
> > interact with from a userspace PoV iirc, so it might be why.
> >
> > > > I'm not sure how that helps. Those are
> > > > common helpers which should accommodate every framework,
> > >
> > > Ok. Fair enough. Besides, if the root-device was sufficient - then I would
> > > actually not see the need for a helper. People could in that case directly
> > > use the root_device_register(). So, if helpers are provided they should be
> > > backed up by a device with a bus then.
> > >
> > > > and your second
> > > > patch breaks the kunit tests for DRM anyway.
> > >
> > > Oh, I must have made an error there. It was supposed to be just a
> > > refactoring with no functional changes. Sorry about that. Anyways, that
> > > patch can be forgotten as Greg opposes using the platform devices in generic
> > > helpers.
> > >
> > > > > Whether it is a feature or bug is beyond my knowledge. Still, I would
> > > > > not say using the root_device_[un]register() in generic code is not
> > > > > feasible - unless all other subsytems have similar refcount handling.
> > > > >
> > > > > Sure thing using root_device_register() root_device_unregister() in DRM does
> > > > > not work as such. This, however, does not mean the generic kunit helpers
> > > > > should use platform_devices to force unwinding?
> > > >
> > > > platform_devices were a quick way to get a device that would have a bus
> > > > and a driver bound to fall into the right patch above. We probably
> > > > shouldn't use platform_devices and a kunit_device sounds like the best
> > > > idea, but the test linked in the original mail I pointed you to should
> > > > work with whatever we come up with. It works with multiple (platform,
> > > > PCI, USB, etc) buses, so the mock we create should behave like their
> > > > real world equivalents.
> > >
> > > Thanks for the patience and the explanation. Now I understand a generic test
> > > device needs to sit on a bus.
> > >
> > > As I said, in my very specific IIO related test the test device does not
> > > need a bus. Hence I'll drop the 'generic helpers' from this series.
> >
> > So, I went around and created a bunch of kunit tests that shows the
> > problem without DRM being involved at all.
> >
> > It does three things:
> >
> > - It registers a device, attaches a devm action, unregisters the device
> > and then checks that the action has ran.
> >
> > - It registers a device, gets a reference to it, attaches a devm
> > action, puts back the reference, unregisters the device and then
> > checks that the action has ran.
> >
> > - It registers a device, gets a reference to it, attaches a devm action
> > that will put back the reference, unregisters the device and then
> > checks that the action has ran.
> >
> > And in three cases: first with a root_device, then platform_device, then
> > a platform_device that has been bound to a driver.
> >
> > Once you've applied that patch, you can run it using:
> >
> > ./tools/testing/kunit/kunit.py run --kunitconfig=drivers/base/test/ devm-inconsistencies
> >
> > You'll see that only the last case passes all the tests, even though the
> > code itself is exactly the same.
> >
>
> This illustrates the problem very nicely, thanks.
>
> I played around a bit with this, and I'm definitely leaning towards
> this being a bug, rather than intentional behaviour. There's actually
> an explicit call to devres_release_all() in device_release() which
> seems to suggest that this should work:
> /*
> * Some platform devices are driven without driver attached
> * and managed resources may have been acquired. Make sure
> * all resources are released.
> *
> * Drivers still can add resources into device after device
> * is deleted but alive, so release devres here to avoid
> * possible memory leak.
> */
>
> My "solution" is just to call devres_release_all() in device_del() as
> well, which fixes your tests (and the drm-test-managed one when ported
> to use root_device_register() or my kunit_device_register() API[1]).
>
> --8<--
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index 6878dfcbf0d6..adfec6185014 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -3778,6 +3778,17 @@ void device_del(struct device *dev)
> device_platform_notify_remove(dev);
> device_links_purge(dev);
>
> + /*
> + * If a device does not have a driver attached, we need to clean up any
> + * managed resources. We do this in device_release(), but it's never
> + * called (and we leak the device) if a managed resource holds a
> + * reference to the device. So release all managed resources here,
> + * like we do in driver_detach(). We still need to do so again in
> + * device_release() in case someone adds a new resource after this
> + * point, though.
> + */
> + devres_release_all(dev);
> +
> bus_notify(dev, BUS_NOTIFY_REMOVED_DEVICE);
> kobject_uevent(&dev->kobj, KOBJ_REMOVE);
> glue_dir = get_glue_dir(dev);
>
> -->8--
>
> It doesn't _seem_ to break anything else, and I've managed to convince
> myself that it's probably the correct fix.

Yeah, as an outsider, I came to the same conclusion last time...

> (Albeit, as someone with a limited knowledge of this part of the code,
> who also hasn't had quite enough sleep, so take that with some salt.)

... but as someone that also have a limited knowledge of that part of
the code, I certainly wasn't sure it was the proper thing to do :)

> Still, I'd agree with Greg that it'd be great to have versions of your
> tests upstream before making any such radical changes.

I just submitted them.

Thanks!
Maxime