Re: [PATCH] phy: core: fix potential UAF in of_phy_simple_xlate()

From: Dmitry Torokhov

Date: Tue Feb 24 2026 - 17:07:01 EST


On Tue, Feb 24, 2026 at 05:47:30PM +0200, Vladimir Oltean wrote:
> On Mon, Feb 23, 2026 at 04:37:39PM -0800, Dmitry Torokhov wrote:
> > > You have hit on a weak spot of the Generic PHY framework and I would
> > > like to encourage you to follow through with patches for this finding
> > > (although it won't be exactly trivial, I think it's doable with some
> > > determination).
> >
> > Thank you for the detailed response. I am not sure how much time I can
> > spend on phy rework as this change was an essentially a drive-by. I only
> > happen to look there because I want to remove
> > class_find_device_by_of_node() in favor of class_find_device_by_fwnode().
>
> OK, I believe you can still do that without any dependency.

Right, this is independent, I just happened to look at that function.

>
> > > On your direct question:
> > > It will impact downstream in subtle and unpleasant ways if you change
> > > the of_xlate() semantics w.r.t. reference counting, but the code will
> > > still compile just as before, i.e. when looking just at your downstream
> > > driver (what 99% of developers do) there will be no obvious way of
> > > knowing that something in the API has changed. I would avoid doing that.
> >
> > Hmm, I was not aware that we needed to take particular measures for
> > benefit of downstream trees...
>
> I think you're entirely missing the point.
>
> Obviously it benefits the maintainer/reviewer first and foremost. If you
> don't make the API change unmissable, drivers expecting the old convention
> will eventually creep their way into the mainline kernel. Why put avoidable
> pressure on reviewers watching out for things like this for years to come,
> when you can force downstream to notice and adapt (OR not make the
> change in the first place). Note that downstream means "API consumers
> invisible to you, the API changer", not only "hopelessly un-upstreamable
> drivers".

Well, this is whole "stable kernel API" argument. Rules of lifetime
around objects may change and the kernel needs to adapt. Anyway, this is
your subsystem so you get to decide.

>
> > > The patch above leaves a few loose ends.
> > >
> > > The most obvious is of_phy_simple_xlate(), which has that put_device()
> > > to balance out class_find_device_by_of_node() - which bumps the device
> > > refcount to 2. It "looks" wrong but it is consistent with vendor
> > > implementations of of_xlate(), which also provide a phy->dev refcount of 1.
> > > And as explained, the refcount never _actually_ drops to 0 with the
> > > above patch.
> > >
> > > I would actually address _only_ of_phy_simple_xlate(), for cosmetics sake.
> > > Instead of devm_of_phy_provider_register(..., of_phy_simple_xlate), I
> > > would offer a new helper, devm_of_phy_simple_provider_register(), which
> > > would internally use a callback that doesn't drop the refcount (say
> > > __of_phy_simple_xlate() for lack of imagination). I would convert vendor
> > > PHY drivers one by one to use this (it would be valid at any time to use
> > > either the old or the new method).
> > >
> > > You'd notice that most of the of_phy_simple_xlate() occurrences go away,
> > > except for freescale/phy-fsl-lynx-28g.c which calls this function
> > > directly from its own lynx_28g_xlate(). It will still have to keep
> > > calling it, and for refcount equalization purposes that weird
> > > put_device() will have to continue to exist. Just leave a comment as to
> > > why that is.
> >
> > I this case I would simply add a comment telling why the reference
> > should (and can) be dropped where it is being dropped in
> > of_phy_simple_xlate() and call it a day. There is not much benefit in
> > adding another helper.
>
> OK.
>
> > > But there are more important loose ends still.
> > >
> > > I mentioned that "zombie" device. We've solved the memory safety issue,
> > > but it's possible for consumers to hold onto a phy whose provider has
> > > disappeared. The refcount of &phy->dev hasn't dropped to 0, so
> > > technically it's a valid object, but from PHY API perspective, it's
> > > still possible to call phy_init(), phy_power_on() and friends on this
> > > PHY, and the Generic PHY core will be happy to further call into the
> > > phy->ops->init(), phy->ops->power_on() etc. But the driver has unbound,
> > > so it should really be left alone.
> > >
> > > If we fix the UAF but leave the zombie PHY problem, we've effectively
> > > done nothing but silence static analysis checkers, while the code path
> > > where the PHY provider unbinds effectively remains treated as poorly as
> > > before, just moving the crashes to a different place.
> > >
> > > I suspect what needs to be done here is to introduce a "bool dead" or
> > > similar, which is to be set from phy_destroy() and checked from every
> > > API call. The idea is that API functions on zombie PHYs should fail
> > > without calling into their driver. I **suppose** that
> > > try_module_get(phy->ops->owner) "tried" to avoid this situation, but
> > > it just protects against module removal, not against "echo device >
> > > /sys/bus/.../unbind". So it's absolutely incomplete and easily bypassable.
> >
> > I think this is a problem common to many kernel subsystems, where there
> > are devices that are vital for other devices to function, but are not
> > parents of said devices. Think about regulators or clocks and such.
> > In many such cases even validating at API level is not sufficient,
> > because if you enable a clock you typically keep it running at least for
> > a while, if not for entire duration of device being bound to a driver.
> > If that clock goes away the device will break even though you are not
> > calling any clock APIs at that particular time.
> >
> > We need some kind of revocation mechanism to signal consumers that
> > providers they rely upon are going away.
>
> Yeah, this gave me pause.
>
> When you unbind a Generic PHY you can kind of expect that the data path
> it affects to not work. But you'd sort of expect it to start working
> again when you rebind the PHY driver.
>
> That will not be the case, though.
>
> The problem, simply put, is that the struct phy that the consumer has
> will be different from the second struct phy that the provider registers
> when it rebinds. Using the stale struct phy, we cannot reach the phy_ops
> of the new provider.
>
> If we implement something similar to what Bartosz Golaszewski suggested
> here:
> https://lpc.events/event/17/contributions/1627/

There is also "revocable" from Tzung-Bi Shih.

>
> aka decouple the struct phy given to the consumer from the struct phy
> provided by the provider, and perform a PHY lookup during each
> phy_init() / phy_power_up() / etc etc operation;
>
> we are kinda able to:
> - match the consumer phy to the provider phy and forward the call to
> phy_ops, if the provider is present (or rebound)
> - return -ENODEV instead of calling into phy_ops, if the provider is not
> present
>
> But there's still one big gap.
>
> PHY consumers are driving the PHY through a state machine when they do
> phy_init -> phy_power_up() -> phy_set_mode() -> [ phy_configure()...].
> When the provider unbinds and then rebinds, that state is lost. But the
> consumer has no idea. It knows it is in a state where it called
> phy_power_up(), and next thing it knows, the power_count is 1 and it
> must call phy_power_down().
>
> The Generic PHY core cannot actually replay the entire state in a
> meaningful way so as to make the provider reprobing completely
> transparent (think of phy_calibrate() calls).
>
> So I guess we're looking at what Bartosz refers to as a notification
> side-channel that the PHY provider is going away.
>
> At least, for drivers that care in a meaningful way. For drivers that
> don't care about such complexity, there seems to be a simpler answer:
> device_link_add(DL_FLAG_AUTOREMOVE_CONSUMER).

Does DL_FLAG_AUTOREMOVE_CONSUMER result in forceful unbinding of the
consumer when provider goes away? And if it happens does it happen in
the right order?

>
> I can also answer why device links with "autoremove consumer" are not a
> universal answer. This is because the consumer itself may be a
> multi-port network switch (single device). If you unbind the Generic PHY
> driver (retimer, SerDes PHY) from port 3, you don't want ports 0, 1 and
> 2 to also disappear from the kernel. You need the more complex PHY
> provider disappearance notification which does something more localized
> (calls phylink_stop(), I don't know).

Aren't the ports represented as individual devices with their own state
and lifetime?

>
>
> I can say right off the bat that this is too complicated for me to even
> think about in more detail than that, at the moment. I would be quite
> happy if it's possible to unbind the PHY driver without the possibility
> to rebind it, as a first step.

Yes, this is something that is not phy specific and I think should be
solved at driver core (or at least driver core should provide subsystems
tools to solve this).

Thanks.

--
Dmitry