Re: [PATCH] phy: core: fix potential UAF in of_phy_simple_xlate()

From: Vladimir Oltean

Date: Wed Feb 25 2026 - 18:52:57 EST

On Tue, Feb 24, 2026 at 12:38:48PM -0800, Dmitry Torokhov wrote:
> > Obviously it benefits the maintainer/reviewer first and foremost. If you
> > don't make the API change unmissable, drivers expecting the old convention
> > will eventually creep their way into the mainline kernel. Why put avoidable
> > pressure on reviewers watching out for things like this for years to come,
> > when you can force downstream to notice and adapt (OR not make the
> > change in the first place). Note that downstream means "API consumers
> > invisible to you, the API changer", not only "hopelessly un-upstreamable
> > drivers".
>
> Well, this is whole "stable kernel API" argument. Rules of lifetime
> around objects may change and the kernel needs to adapt.

Actually it's not the "stable kernel API" argument.

I just said that if the of_xlate() callbacks currently need to do:

return phy;

and you propose a new scheme where they have to do:

get_device(&phy->dev);
return phy;

, then you are creating the possibility for:

- the API change to be missed by anyone who hasn't been _actively_
monitoring you, because the old code will still compile just as well,
and won't show any superficial runtime problem either
- the reviewers and maintainers to not notice that patches were
developed according to the old convention

and such patches may even get accepted, but be buggy.

I actually don't care whether the kernel API is stable or not. But
if you change it, make it obvious it has changed.

> Anyway, this is your subsystem so you get to decide.

Not my subsystem. Sorry if I gave you the wrong impression.

> > > > But there are more important loose ends still.
> > > >
> > > > I mentioned that "zombie" device. We've solved the memory safety issue,
> > > > but it's possible for consumers to hold onto a phy whose provider has
> > > > disappeared. The refcount of &phy->dev hasn't dropped to 0, so
> > > > technically it's a valid object, but from PHY API perspective, it's
> > > > still possible to call phy_init(), phy_power_on() and friends on this
> > > > PHY, and the Generic PHY core will be happy to further call into the
> > > > phy->ops->init(), phy->ops->power_on() etc. But the driver has unbound,
> > > > so it should really be left alone.
> > > >
> > > > If we fix the UAF but leave the zombie PHY problem, we've effectively
> > > > done nothing but silence static analysis checkers, while the code path
> > > > where the PHY provider unbinds effectively remains treated as poorly as
> > > > before, just moving the crashes to a different place.
> > > >
> > > > I suspect what needs to be done here is to introduce a "bool dead" or
> > > > similar, which is to be set from phy_destroy() and checked from every
> > > > API call. The idea is that API functions on zombie PHYs should fail
> > > > without calling into their driver. I **suppose** that
> > > > try_module_get(phy->ops->owner) "tried" to avoid this situation, but
> > > > it just protects against module removal, not against "echo device >
> > > > /sys/bus/.../unbind". So it's absolutely incomplete and easily bypassable.
> > >
> > > I think this is a problem common to many kernel subsystems, where there
> > > are devices that are vital for other devices to function, but are not
> > > parents of said devices. Think about regulators or clocks and such.
> > > In many such cases even validating at API level is not sufficient,
> > > because if you enable a clock you typically keep it running at least for
> > > a while, if not for entire duration of device being bound to a driver.
> > > If that clock goes away the device will break even though you are not
> > > calling any clock APIs at that particular time.
> > >
> > > We need some kind of revocation mechanism to signal consumers that
> > > providers they rely upon are going away.
> >
> > Yeah, this gave me pause.
> >
> > When you unbind a Generic PHY you can kind of expect that the data path
> > it affects to not work. But you'd sort of expect it to start working
> > again when you rebind the PHY driver.
> >
> > That will not be the case, though.
> >
> > The problem, simply put, is that the struct phy that the consumer has
> > will be different from the second struct phy that the provider registers
> > when it rebinds. Using the stale struct phy, we cannot reach the phy_ops
> > of the new provider.
> >
> > If we implement something similar to what Bartosz Golaszewski suggested
> > here:
> > https://lpc.events/event/17/contributions/1627/
>
> There is also "revocable" from Tzung-Bi Shih.

OK. Thanks for the heads up.

> > aka decouple the struct phy given to the consumer from the struct phy
> > provided by the provider, and perform a PHY lookup during each
> > phy_init() / phy_power_up() / etc etc operation;
> >
> > we are kinda able to:
> > - match the consumer phy to the provider phy and forward the call to
> > phy_ops, if the provider is present (or rebound)
> > - return -ENODEV instead of calling into phy_ops, if the provider is not
> > present
> >
> > But there's still one big gap.
> >
> > PHY consumers are driving the PHY through a state machine when they do
> > phy_init -> phy_power_up() -> phy_set_mode() -> [ phy_configure()...].
> > When the provider unbinds and then rebinds, that state is lost. But the
> > consumer has no idea. It knows it is in a state where it called
> > phy_power_up(), and next thing it knows, the power_count is 1 and it
> > must call phy_power_down().
> >
> > The Generic PHY core cannot actually replay the entire state in a
> > meaningful way so as to make the provider reprobing completely
> > transparent (think of phy_calibrate() calls).
> >
> > So I guess we're looking at what Bartosz refers to as a notification
> > side-channel that the PHY provider is going away.
> >
> > At least, for drivers that care in a meaningful way. For drivers that
> > don't care about such complexity, there seems to be a simpler answer:
> > device_link_add(DL_FLAG_AUTOREMOVE_CONSUMER).
>
> Does DL_FLAG_AUTOREMOVE_CONSUMER result in forceful unbinding of the
> consumer when provider goes away? And if it happens does it happen in
> the right order?

If "the right order" means "consumer before provider", yes.

> > I can also answer why device links with "autoremove consumer" are not a
> > universal answer. This is because the consumer itself may be a
> > multi-port network switch (single device). If you unbind the Generic PHY
> > driver (retimer, SerDes PHY) from port 3, you don't want ports 0, 1 and
> > 2 to also disappear from the kernel. You need the more complex PHY
> > provider disappearance notification which does something more localized
> > (calls phylink_stop(), I don't know).
>
> Aren't the ports represented as individual devices with their own state
> and lifetime?

Nope. A single platform_device/pci_device/spi_device/i2c_client/mdio_device/etc
represents an entire switch, and its driver will call register_netdevice()
multiple times, once per port. This is the switchdev model as per
Documentation/networking/switchdev.rst. Technically each netdev has its
own "class" device (for /sys/class/net/) but it's not a "bus" device,
doesn't have a bound driver, and their parent is the aforementioned
platform_device/etc.

> > I can say right off the bat that this is too complicated for me to even
> > think about in more detail than that, at the moment. I would be quite
> > happy if it's possible to unbind the PHY driver without the possibility
> > to rebind it, as a first step.
>
> Yes, this is something that is not phy specific and I think should be
> solved at driver core (or at least driver core should provide subsystems
> tools to solve this).

Some subsystems I'm slightly more familiar with do have mechanisms to
handle this.

Networking has (a pretty complicated) netdev_wait_allrefs_any(), which
waits for all references to a net_device to be dropped before fully
unregistering it (these are net_device :: dev_refcnt, different from
struct device references). Drivers are supposed to monitor the netdev
notifier call chain for the NETDEV_UNREGISTER event, and let go of the
reference.

Sadly I don't know enough about the Generic PHY subsystem to be making
generalizations with any claim of correctness, but I suspect the
majority of consumers are in a 1:1 relationship with the PHY provider,
and would be happy with the cost/benefit ratio given by the
device_link(DL_FLAG_AUTOREMOVE_CONSUMER) behaviour. Only a select few
would care to continue their lifetime in a PHY-less environment
(normal operation on other ports / degraded on the PHY-less port,
reestablish connection to provider when it comes back, etc).

I'm not sure that there exists a space for "revocable" in Generic PHY
between the above two options. Only if the device link idea doesn't work
out, for some reason I'm not seeing. I'll have to put this thought in
the background and let it simmer for a while.