Re: [PATCH 1/1] sofware node: Only the managing device can unreference managed software node

From: Mike Isely

Date: Sat Feb 28 2026 - 11:34:28 EST


On Sat, 28 Feb 2026, Andy Shevchenko wrote:

> On Fri, Feb 27, 2026 at 11:55:51AM -0600, Mike Isely wrote:
> > On Thu, 26 Feb 2026, Andy Shevchenko wrote:

[snip]

>
> > I am unfamiliar with that driver or its surroundings, so I don't know if
> > it is a viable means to reproduce the problem. I can spend time
> > analyzing that code further if you wish. However it might be more
> > productive to try to succinctly, generically, spell out the steps that
> > lead to trouble:
> >
> > 1. Somebody creates a struct device instance.
> >
> > 2. Somebody calls device_create_managed_software_node() associating it
> > with that struct device instance. This causes two kobject references to
> > be counted against the swnode instance (one for the struct device and
> > one because the managed flag was set). refcount=2
> >
> > 3. Some time later, a child struct device is created and the properties
> > of the parent are shared with the child. This causes another kobject
> > reference to be counted against the swnode instance. refcount=3
> >
> > 4. Some time later, that child struct device instance goes away (perhaps
> > due to a hotplug removal). During the tear-down,
> > software_node_notify_remove() notices that the managed field of the
> > swnode instance is set, so TWO kobject references are removed instead of
> > the single reference created by the previous step. refcount=1
> >
> > 5. Repeat step 3. refcount=2
> >
> > 6. Repeat step 4. refcount=0
> >
> > 7. Now the kobject reference count inside the swnode instance has
> > dropped to zero and so the swnode instance is released. Notice that the
> > parent device is still holding a physical reference, now pointing into
> > freed memory. Chaos likely follows. If lucky, a NULL pointer kernel
> > oops is the result.
> >
> > I confirmed this exact sequence of behavior in the context of the sfp
> > kernel module with appropriate printk debug trace added. While that's
> > going to be hard for you to reproduce without sfp hardware, anything
> > else which performs those steps should produce similar results.
> >
> > The patch fixes this by causing step 4 to only do the second reference
> > decrement if the device being torn down is the same one that established
> > the managing relationship in the first place. Then everything stays
> > balanced.
> >
> > Notice that this has nothing to do with whether or not the properties
> > are const, nor does it have anything to do with FPGA management. I
> > apologize if I caused confusion leading to that.
>
> No problem, now with this very deep and comprehensive analysis everything
> is clear. Thanks for providing this all information!
>
> > It's purely the
> > combination of a managed swnode instance used for a device that will
> > create / remove child devices over time (due perhaps to hotplug
> > activity).
>
> > And because platform_device_register_full() will implicitly employ a
> > managed swnode instance any time it is passed properties in its pdevinfo
> > argument, then by implication anyone calling
> > platform_device_register_full() in that manner - which happens in
> > multiple other places in the kernel - is at risk for the same issue.
>
> Yeah, I will think about this more, but if somebody beats me up to it
> and confirms that this is the best what we can do, I will have no
> objections. I dunno, let's say in a couple of weeks?
>
>

No rush. We're already using the patch internally of course.

I can resend you the patch, fixing the process issues you cited, along
with the commit comment adjusted to include the sequence above which
finally made the issue crystal clear. I'll try to get that to you on
Monday or Tuesday.

-Mike
isely@xxxxxxxxx