Re: [PATCH] phy: core: fix potential UAF in of_phy_simple_xlate()

From: Vladimir Oltean

Date: Mon Feb 23 2026 - 18:19:51 EST

Hi Dmitry,

On Fri, Feb 20, 2026 at 05:01:50PM -0800, Dmitry Torokhov wrote:
> On Thu, Feb 19, 2026 at 04:11:37PM -0800, Dmitry Torokhov wrote:
> > On Thu, Feb 19, 2026 at 03:57:11PM -0800, Dmitry Torokhov wrote:
> > > The implementation put_device()s located device and then uses
> > > container_of() on the pointer. The device may disappear by that time,
> > > resulting in UAF.
> > >
> > > Fix the problem by keeping the reference to the framer device,
> > > avoiding getting an extra reference to it in framer_get(), and making
> > > sure to drop the reference in error path when we fail to get the module.
> >
> > Hmm, I was too rash. There are bunch of other xlate functions that need
> > to be updated to take the reference.
>
> So I am convinced that xlate functions need to bump up the reference to
> phy devices they return. The question is how to deal with the ones that
> do not. I can either convert them in the same patch (the changes are
> quite mechanical) or we can do the whole song and dance, introduce a
> flag, set it up in converted xlate functions, have the core respect it,
> and then remove it from xlates and from the core when it is all done.
>
> Please let me know.

You have hit on a weak spot of the Generic PHY framework and I would
like to encourage you to follow through with patches for this finding
(although it won't be exactly trivial, I think it's doable with some
determination).

On your direct question:
It will impact downstream in subtle and unpleasant ways if you change
the of_xlate() semantics w.r.t. reference counting, but the code will
still compile just as before, i.e. when looking just at your downstream
driver (what 99% of developers do) there will be no obvious way of
knowing that something in the API has changed. I would avoid doing that.

But first the problem. Too little has been said about the problem, and
too much about the solution. We can't find a good solution if we don't
call out the problem properly first.

The phy_get() function follows a "lookup-then-get" approach for PHY
providers, rather than "atomic-lookup-and-get".

Namely, the phy_provider->of_xlate() caller, which is _of_phy_get(),
returns a struct phy * with its underlying device reference count
nominally at 1. All callers of _of_phy_get() do later call get_device()
to bump this refcount for each PHY consumer, but it is "too late" in the
sense that a window has been opened where the PHY provider driver can
unbind, and the reference count of &phy->dev can drop to 0 and it can be
freed.

Immediately after being created by phy_create(), the &phy->dev has a
refcount of 1, and only the action of its provider driver (calling
phy_destroy() directly or through devres, on driver unbind) can drop
that refcount to 0. As long as the driver doesn't do that, the &phy->dev
refcount is not in danger.

Then the PHY is exposed to the outside world using of_phy_provider_register(),
where the fact that the provider driver can concurrently unregister
starts being a problem.

Therefore, I would say that the problem is that consumers, aka phy_get()
callers, can get a phy with a &phy->dev refcount that is not already bumped
by the time they are handed over that PHY.

Mechanically, PHY lookup happens under &phy_provider_mutex, and I
believe it would be sufficient to call get_device() under this lock, in
order for consumers to get PHYs with their reference bumped.

Why?

Let's consider what a PHY provider does when unbinding. It runs the PHY
registration and creation processes in reverse, i.e.
- it calls of_phy_provider_unregister() directly or through devres
- it calls phy_destroy() directly or through devres

Since of_phy_provider_unregister() also acquires &phy_provider_mutex,
this acts like a synchronization barrier. There are 2 cases:

Consumer Provider
phy_get()
-> mutex_lock(&phy_provider_mutex)
-> get_device(&phy->dev) -> 2
-> mutex_unlock(&phy_provider_mutex)
of_phy_provider_unregister()
-> mutex_lock(&phy_provider_mutex)
-> removes phy from list
-> mutex_unlock(&phy_provider_mutex)
phy_destroy()
-> device_unregister(&phy->dev)
-> device_del(dev);
-> put_device(dev); // -> 1

and the consumer remains with a "zombie" PHY device (more later)

Or

Consumer Provider
phy_get()
of_phy_provider_unregister()
-> mutex_lock(&phy_provider_mutex)
-> removes phy from list
-> mutex_unlock(&phy_provider_mutex)
-> mutex_lock(&phy_provider_mutex)
-> doesn't find the PHY device!
It was unpublished even if not freed
-> mutex_unlock(&phy_provider_mutex)
phy_destroy()
-> device_unregister(&phy->dev)
-> device_del(dev);
-> put_device(dev); // -> 1

and the consumer won't be allowed to even find the PHY device in this case.
So this is why get_device() under phy_provider_mutex actually helps.

Now, it is much less important whether you push the get_device() to the
actual of_xlate() PHY vendor implementation or not. For simplicity sake,
I suggest you don't do that, but instead keep it in the core, to
preserve driver API semantics (patch not even compile-tested):

-- >8 --
diff --git a/drivers/phy/phy-core.c b/drivers/phy/phy-core.c
index 21aaf2f76e53..c50e38f057a8 100644
--- a/drivers/phy/phy-core.c
+++ b/drivers/phy/phy-core.c
@@ -122,17 +122,19 @@ EXPORT_SYMBOL_GPL(phy_remove_lookup);
static struct phy *phy_find(struct device *dev, const char *con_id)
{
const char *dev_id = dev_name(dev);
- struct phy_lookup *p, *pl = NULL;
+ struct phy *phy = NULL;
+ struct phy_lookup *p;

mutex_lock(&phy_provider_mutex);
list_for_each_entry(p, &phys, node)
if (!strcmp(p->dev_id, dev_id) && !strcmp(p->con_id, con_id)) {
- pl = p;
+ phy = p->phy;
+ get_device(&phy->dev);
break;
}
mutex_unlock(&phy_provider_mutex);

- return pl ? pl->phy : ERR_PTR(-ENODEV);
+ return phy ? phy : ERR_PTR(-ENODEV);
}

static struct phy_provider *of_phy_provider_lookup(struct device_node *node)
@@ -649,6 +651,7 @@ static struct phy *_of_phy_get(struct device_node *np, int index)
}

phy = phy_provider->of_xlate(phy_provider->dev, &args);
+ get_device(&phy->dev);

out_put_module:
module_put(phy_provider->owner);
@@ -682,10 +685,10 @@ struct phy *of_phy_get(struct device_node *np, const char *con_id)
if (IS_ERR(phy))
return phy;

- if (!try_module_get(phy->ops->owner))
+ if (!try_module_get(phy->ops->owner)) {
+ put_device(&phy->dev);
return ERR_PTR(-EPROBE_DEFER);
-
- get_device(&phy->dev);
+ }

return phy;
}
@@ -803,10 +806,10 @@ struct phy *phy_get(struct device *dev, const char *string)
if (IS_ERR(phy))
return phy;

- if (!try_module_get(phy->ops->owner))
+ if (!try_module_get(phy->ops->owner)) {
+ put_device(&phy->dev);
return ERR_PTR(-EPROBE_DEFER);
-
- get_device(&phy->dev);
+ }

link = device_link_add(dev, &phy->dev, DL_FLAG_STATELESS);
if (!link)
@@ -969,11 +972,10 @@ struct phy *devm_of_phy_get_by_index(struct device *dev, struct device_node *np,

if (!try_module_get(phy->ops->owner)) {
devres_free(ptr);
+ put_device(&phy->dev);
return ERR_PTR(-EPROBE_DEFER);
}

- get_device(&phy->dev);
-
*ptr = phy;
devres_add(dev, ptr);

-- >8 --

The patch above leaves a few loose ends.

The most obvious is of_phy_simple_xlate(), which has that put_device()
to balance out class_find_device_by_of_node() - which bumps the device
refcount to 2. It "looks" wrong but it is consistent with vendor
implementations of of_xlate(), which also provide a phy->dev refcount of 1.
And as explained, the refcount never _actually_ drops to 0 with the
above patch.

I would actually address _only_ of_phy_simple_xlate(), for cosmetics sake.
Instead of devm_of_phy_provider_register(..., of_phy_simple_xlate), I
would offer a new helper, devm_of_phy_simple_provider_register(), which
would internally use a callback that doesn't drop the refcount (say
__of_phy_simple_xlate() for lack of imagination). I would convert vendor
PHY drivers one by one to use this (it would be valid at any time to use
either the old or the new method).

You'd notice that most of the of_phy_simple_xlate() occurrences go away,
except for freescale/phy-fsl-lynx-28g.c which calls this function
directly from its own lynx_28g_xlate(). It will still have to keep
calling it, and for refcount equalization purposes that weird
put_device() will have to continue to exist. Just leave a comment as to
why that is.

But there are more important loose ends still.

I mentioned that "zombie" device. We've solved the memory safety issue,
but it's possible for consumers to hold onto a phy whose provider has
disappeared. The refcount of &phy->dev hasn't dropped to 0, so
technically it's a valid object, but from PHY API perspective, it's
still possible to call phy_init(), phy_power_on() and friends on this
PHY, and the Generic PHY core will be happy to further call into the
phy->ops->init(), phy->ops->power_on() etc. But the driver has unbound,
so it should really be left alone.

If we fix the UAF but leave the zombie PHY problem, we've effectively
done nothing but silence static analysis checkers, while the code path
where the PHY provider unbinds effectively remains treated as poorly as
before, just moving the crashes to a different place.

I suspect what needs to be done here is to introduce a "bool dead" or
similar, which is to be set from phy_destroy() and checked from every
API call. The idea is that API functions on zombie PHYs should fail
without calling into their driver. I **suppose** that
try_module_get(phy->ops->owner) "tried" to avoid this situation, but
it just protects against module removal, not against "echo device >
/sys/bus/.../unbind". So it's absolutely incomplete and easily bypassable.

Finally, I have identified one more loose end still.

/**
* of_phy_put() - release the PHY
* @phy: the phy returned by of_phy_get()
*
* Releases a refcount the caller received from of_phy_get().
*/
void of_phy_put(struct phy *phy)
{
if (!phy || IS_ERR(phy))
return;

mutex_lock(&phy->mutex);
if (phy->ops->release)
phy->ops->release(phy);
mutex_unlock(&phy->mutex);

module_put(phy->ops->owner);
put_device(&phy->dev);
}
EXPORT_SYMBOL_GPL(of_phy_put);

This function is called by PHY consumers. A PHY provider can have
multiple consumers (example in the case of Ethernet: a QSGMII SerDes
lane has 4 MAC ports multiplexed onto it).

If a single consumer calls of_phy_put() to release its reference to the
SerDes lane, the phy->ops->release() method makes absolutely no sense.
There are 3 remaining consumers with handles to the lane! But we aren't
even telling the PHY which consumer has disappeared! It has nothing
useful to do with this information.

Looking at actual phy_ops implementations, what they want to know is
when _all_ consumers went away, not when individual consumers did.
So the phy->ops->release() call needs to be put somewhere which is
executed when all consumers disappear. If I were to guess, that would be
the phy_release() class callback, but this is completely untested.

Sorry that this email is a bit long, it got a bit late writing it and I
don't have a lot of energy left to trim it down.
In summary I found 4 highly related problems:
- use after free
- misleading of_phy_simple_xlate() API pattern (not a functional issue)
- zombie PHYs
- premature release call