Re: [PATCH] phy: core: fix potential UAF in of_phy_simple_xlate()

From: Dmitry Torokhov

Date: Mon Feb 23 2026 - 19:38:21 EST


Hi Vladimir,

On Tue, Feb 24, 2026 at 01:15:00AM +0200, Vladimir Oltean wrote:
> Hi Dmitry,
>
> On Fri, Feb 20, 2026 at 05:01:50PM -0800, Dmitry Torokhov wrote:
> > On Thu, Feb 19, 2026 at 04:11:37PM -0800, Dmitry Torokhov wrote:
> > > On Thu, Feb 19, 2026 at 03:57:11PM -0800, Dmitry Torokhov wrote:
> > > > The implementation put_device()s located device and then uses
> > > > container_of() on the pointer. The device may disappear by that time,
> > > > resulting in UAF.
> > > >
> > > > Fix the problem by keeping the reference to the framer device,
> > > > avoiding getting an extra reference to it in framer_get(), and making
> > > > sure to drop the reference in error path when we fail to get the module.
> > >
> > > Hmm, I was too rash. There are bunch of other xlate functions that need
> > > to be updated to take the reference.
> >
> > So I am convinced that xlate functions need to bump up the reference to
> > phy devices they return. The question is how to deal with the ones that
> > do not. I can either convert them in the same patch (the changes are
> > quite mechanical) or we can do the whole song and dance, introduce a
> > flag, set it up in converted xlate functions, have the core respect it,
> > and then remove it from xlates and from the core when it is all done.
> >
> > Please let me know.
>
> You have hit on a weak spot of the Generic PHY framework and I would
> like to encourage you to follow through with patches for this finding
> (although it won't be exactly trivial, I think it's doable with some
> determination).

Thank you for the detailed response. I am not sure how much time I can
spend on phy rework as this change was an essentially a drive-by. I only
happen to look there because I want to remove
class_find_device_by_of_node() in favor of class_find_device_by_fwnode().

>
> On your direct question:
> It will impact downstream in subtle and unpleasant ways if you change
> the of_xlate() semantics w.r.t. reference counting, but the code will
> still compile just as before, i.e. when looking just at your downstream
> driver (what 99% of developers do) there will be no obvious way of
> knowing that something in the API has changed. I would avoid doing that.

Hmm, I was not aware that we needed to take particular measures for
benefit of downstream trees...

>
> But first the problem. Too little has been said about the problem, and
> too much about the solution. We can't find a good solution if we don't
> call out the problem properly first.
>
> The phy_get() function follows a "lookup-then-get" approach for PHY
> providers, rather than "atomic-lookup-and-get".
>
> Namely, the phy_provider->of_xlate() caller, which is _of_phy_get(),
> returns a struct phy * with its underlying device reference count
> nominally at 1. All callers of _of_phy_get() do later call get_device()
> to bump this refcount for each PHY consumer, but it is "too late" in the
> sense that a window has been opened where the PHY provider driver can
> unbind, and the reference count of &phy->dev can drop to 0 and it can be
> freed.

Yes.

>
> Immediately after being created by phy_create(), the &phy->dev has a
> refcount of 1, and only the action of its provider driver (calling
> phy_destroy() directly or through devres, on driver unbind) can drop
> that refcount to 0. As long as the driver doesn't do that, the &phy->dev
> refcount is not in danger.

Right.

>
> Then the PHY is exposed to the outside world using of_phy_provider_register(),
> where the fact that the provider driver can concurrently unregister
> starts being a problem.
>
> Therefore, I would say that the problem is that consumers, aka phy_get()
> callers, can get a phy with a &phy->dev refcount that is not already bumped
> by the time they are handed over that PHY.
>
> Mechanically, PHY lookup happens under &phy_provider_mutex, and I
> believe it would be sufficient to call get_device() under this lock, in
> order for consumers to get PHYs with their reference bumped.

<skip the explanation and a patch>

Yes, given that there is 1:1 mapping between device and provider (not
phy instance but its parent) it looks like we can drop the reference and
then bump it up again as long as we hold that mutex.

>
> The patch above leaves a few loose ends.
>
> The most obvious is of_phy_simple_xlate(), which has that put_device()
> to balance out class_find_device_by_of_node() - which bumps the device
> refcount to 2. It "looks" wrong but it is consistent with vendor
> implementations of of_xlate(), which also provide a phy->dev refcount of 1.
> And as explained, the refcount never _actually_ drops to 0 with the
> above patch.
>
> I would actually address _only_ of_phy_simple_xlate(), for cosmetics sake.
> Instead of devm_of_phy_provider_register(..., of_phy_simple_xlate), I
> would offer a new helper, devm_of_phy_simple_provider_register(), which
> would internally use a callback that doesn't drop the refcount (say
> __of_phy_simple_xlate() for lack of imagination). I would convert vendor
> PHY drivers one by one to use this (it would be valid at any time to use
> either the old or the new method).
>
> You'd notice that most of the of_phy_simple_xlate() occurrences go away,
> except for freescale/phy-fsl-lynx-28g.c which calls this function
> directly from its own lynx_28g_xlate(). It will still have to keep
> calling it, and for refcount equalization purposes that weird
> put_device() will have to continue to exist. Just leave a comment as to
> why that is.

I this case I would simply add a comment telling why the reference
should (and can) be dropped where it is being dropped in
of_phy_simple_xlate() and call it a day. There is not much benefit in
adding another helper.

>
>
> But there are more important loose ends still.
>
> I mentioned that "zombie" device. We've solved the memory safety issue,
> but it's possible for consumers to hold onto a phy whose provider has
> disappeared. The refcount of &phy->dev hasn't dropped to 0, so
> technically it's a valid object, but from PHY API perspective, it's
> still possible to call phy_init(), phy_power_on() and friends on this
> PHY, and the Generic PHY core will be happy to further call into the
> phy->ops->init(), phy->ops->power_on() etc. But the driver has unbound,
> so it should really be left alone.
>
> If we fix the UAF but leave the zombie PHY problem, we've effectively
> done nothing but silence static analysis checkers, while the code path
> where the PHY provider unbinds effectively remains treated as poorly as
> before, just moving the crashes to a different place.
>
> I suspect what needs to be done here is to introduce a "bool dead" or
> similar, which is to be set from phy_destroy() and checked from every
> API call. The idea is that API functions on zombie PHYs should fail
> without calling into their driver. I **suppose** that
> try_module_get(phy->ops->owner) "tried" to avoid this situation, but
> it just protects against module removal, not against "echo device >
> /sys/bus/.../unbind". So it's absolutely incomplete and easily bypassable.

I think this is a problem common to many kernel subsystems, where there
are devices that are vital for other devices to function, but are not
parents of said devices. Think about regulators or clocks and such.
In many such cases even validating at API level is not sufficient,
because if you enable a clock you typically keep it running at least for
a while, if not for entire duration of device being bound to a driver.
If that clock goes away the device will break even though you are not
calling any clock APIs at that particular time.

We need some kind of revocation mechanism to signal consumers that
providers they rely upon are going away.

>
>
> Finally, I have identified one more loose end still.
>
> /**
> * of_phy_put() - release the PHY
> * @phy: the phy returned by of_phy_get()
> *
> * Releases a refcount the caller received from of_phy_get().
> */
> void of_phy_put(struct phy *phy)
> {
> if (!phy || IS_ERR(phy))
> return;
>
> mutex_lock(&phy->mutex);
> if (phy->ops->release)
> phy->ops->release(phy);
> mutex_unlock(&phy->mutex);
>
> module_put(phy->ops->owner);
> put_device(&phy->dev);
> }
> EXPORT_SYMBOL_GPL(of_phy_put);
>
> This function is called by PHY consumers. A PHY provider can have
> multiple consumers (example in the case of Ethernet: a QSGMII SerDes
> lane has 4 MAC ports multiplexed onto it).
>
> If a single consumer calls of_phy_put() to release its reference to the
> SerDes lane, the phy->ops->release() method makes absolutely no sense.
> There are 3 remaining consumers with handles to the lane! But we aren't
> even telling the PHY which consumer has disappeared! It has nothing
> useful to do with this information.
>
> Looking at actual phy_ops implementations, what they want to know is
> when _all_ consumers went away, not when individual consumers did.
> So the phy->ops->release() call needs to be put somewhere which is
> executed when all consumers disappear. If I were to guess, that would be
> the phy_release() class callback, but this is completely untested.

Shouldn't phy->ops->release() keep track of users and decide when to
destroy the resources? The rest seem fine as they simply drop references
(I assume each user of shared phy will bump up references as needed?)

>
> Sorry that this email is a bit long, it got a bit late writing it and I
> don't have a lot of energy left to trim it down.
> In summary I found 4 highly related problems:
> - use after free
> - misleading of_phy_simple_xlate() API pattern (not a functional issue)
> - zombie PHYs
> - premature release call

Thanks.

--
Dmitry