Re: [PATCH v3] driver core: fw_devlink: Stop trying to optimize cycle detection logic
From: Luca Ceresoli
Date: Fri Dec 06 2024 - 04:31:56 EST
Hi Saravana,
On Wed, 4 Dec 2024 12:48:26 +0100
Luca Ceresoli <luca.ceresoli@xxxxxxxxxxx> wrote:
> Hello Saravana,
>
> +Cc. DT maintainers, Hervé
>
> On Wed, 30 Oct 2024 10:10:07 -0700
> Saravana Kannan <saravanak@xxxxxxxxxx> wrote:
>
> > In attempting to optimize fw_devlink runtime, I introduced numerous cycle
> > detection bugs by foregoing cycle detection logic under specific
> > conditions. Each fix has further narrowed the conditions for optimization.
> >
> > It's time to give up on these optimization attempts and just run the cycle
> > detection logic every time fw_devlink tries to create a device link.
> >
> > The specific bug report that triggered this fix involved a supplier fwnode
> > that never gets a device created for it. Instead, the supplier fwnode is
> > represented by the device that corresponds to an ancestor fwnode.
> >
> > In this case, fw_devlink didn't do any cycle detection because the cycle
> > detection logic is only run when a device link is created between the
> > devices that correspond to the actual consumer and supplier fwnodes.
> >
> > With this change, fw_devlink will run cycle detection logic even when
> > creating SYNC_STATE_ONLY proxy device links from a device that is an
> > ancestor of a consumer fwnode.
> >
> > Reported-by: Tomi Valkeinen <tomi.valkeinen@xxxxxxxxxxxxxxxx>
> > Closes: https://lore.kernel.org/all/1a1ab663-d068-40fb-8c94-f0715403d276@xxxxxxxxxxxxxxxx/
> > Fixes: 6442d79d880c ("driver core: fw_devlink: Improve detection of overlapping cycles")
> > Tested-by: Tomi Valkeinen <tomi.valkeinen@xxxxxxxxxxxxxxxx>
> > Signed-off-by: Saravana Kannan <saravanak@xxxxxxxxxx>
>
> After rebasing my work for the hotplug connector driver using device
> tree overlays [0] on v6.13-rc1 I started getting these OF errors on
> overlay removal:
>
> OF: ERROR: memory leak, expected refcount 1 instead of 2, of_node_get()/of_node_put() unbalanced - destroy cset entry: attach overlay node /addon-connector/devices/panel-dsi-lvds
> OF: ERROR: memory leak, expected refcount 1 instead of 2, of_node_get()/of_node_put() unbalanced - destroy cset entry: attach overlay node /addon-connector/devices/backlight-addon
> OF: ERROR: memory leak, expected refcount 1 instead of 2, of_node_get()/of_node_put() unbalanced - destroy cset entry: attach overlay node /addon-connector/devices/battery-charger
> OF: ERROR: memory leak, expected refcount 1 instead of 2, of_node_get()/of_node_put() unbalanced - destroy cset entry: attach overlay node /addon-connector/devices/regulator-addon-5v0-sys
> OF: ERROR: memory leak, expected refcount 1 instead of 2, of_node_get()/of_node_put() unbalanced - destroy cset entry: attach overlay node /addon-connector/devices/regulator-addon-3v3-sys
>
> ...and many more. Exactly one per each device in the overlay 'devices'
> node, each implemented by a platform driver.
>
> Bisecting found this patch is triggering these error messages, which
> in fact disappear by reverting it.
>
> I looked at the differences in dmesg and /sys/class/devlink/ in the
> "good" and "bad" cases, and found almost no differences. The only
> relevant difference is in cycle detection for the panel node, which was
> expected, but nothing about all the other nodes like regulators.
>
> Enabling debug messages in core.c also does not show significant
> changes between the two cases, even though it's hard to be sure given
> the verbosity of the log and the reordering of messages.
>
> I suspect the new version of the cycle removal code is missing an
> of_node_get() somewhere, but that is not directly visible in the patch
> diff itself.
I collected some more info by adding a bit of logging for one of the
affected devices.
It looks like the of_node_get() and of_node_put() in the overlay
loading phase are the same, even though not completely in the same
order. So after overlay insertion we should have the same refcount with
and without your patch.
There is a difference on overlay removal however: an of_node_put() call
is absent with 6.13-rc1 code (errors emitted), and becomes present by
just reverting your patch (the "good" case). Here's the stack trace of
this call:
Call trace:
show_stack+0x20/0x38 (C)
dump_stack_lvl+0x74/0x90
dump_stack+0x18/0x28
of_node_put+0x50/0x70
platform_device_release+0x24/0x68
device_release+0x3c/0xa0
kobject_put+0xa4/0x118
device_link_release_fn+0x60/0xd8
process_one_work+0x158/0x3c0
worker_thread+0x2d8/0x3e8
kthread+0x118/0x128
ret_from_fork+0x10/0x20
So for some reason device_link_release_fn() is not leading to a
of_node_put() call after adding your patch.
Quick code inspection did not show any useful info for me to understand
more.
Ideas?
Luca
--
Luca Ceresoli, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com