Re: [RFC PATCH] driver core: Don't link the device to the bus until we're ready to probe

From: Doug Anderson

Date: Sat Mar 21 2026 - 03:35:57 EST


Hi,

On Fri, Mar 20, 2026 at 10:41 PM Greg Kroah-Hartman
<gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
>
> On Fri, Mar 20, 2026 at 08:06:58PM -0700, Douglas Anderson wrote:
> > The moment we link a "struct device" into the list of devices for the
> > bus, it's possible probe can happen. This is because another thread
> > can load the driver at any time and that can cause the device to
> > probe. This has been seen in practice with a stack crawl that looks
> > like this [1]:
> >
> > really_probe()
> > __driver_probe_device()
> > driver_probe_device()
> > __driver_attach()
> > bus_for_each_dev()
> > driver_attach()
> > bus_add_driver()
> > driver_register()
> > __platform_driver_register()
> > init_module() [some module]
> > do_one_initcall()
> > do_init_module()
> > load_module()
> > __arm64_sys_finit_module()
> > invoke_syscall()
>
> Are you sure this isn't just a platform bus issue? A bus should NOT be
> allowing a driver to be added at the same time a device is being added
> for that bus, ideally there should be a bus-specific lock somewhere for
> this.

Sure, if the right fix for this is somewhere in the platform bus code
then I'd be happy with a patch there to fix it. ...but from my quick
glance (admittedly, it's Friday night and I'm tired), it seems like
the problem is just with driver_register() being called at the same
time as device_add().

Certainly adding some sort of locking could be a solution (happy for
someone to tell me where to place them), but we'd have to make sure we
aren't regressing performance for the normal case...


> When a device is added to the bus, yes, a probe can happen, and is
> expected to happen, for that device, so this feels odd.
>
> that being said, your patch does seem sane, and I don't see anything
> obviously wrong with it. But it feels odd that this is just now showing
> up for something that has been this way for a few decades...

I suspect it's a latent bug that was triggered by a new Android
feature. It's showing up on phones that have
"ro.boot.load_modules_parallel" set. I think you can get to the
relevant source code at:

https://cs.android.com/android/platform/superproject/main/+/main:system/core/libmodprobe/libmodprobe.cpp?q=LoadModulesParallel

I suspect the bug is never triggered with more normal module loading
schemes. Indeed, one phone that has nearly the same set of drivers but
has parallel module loading turned off has no reports of this
problem...

I'd also note that the only actual symptom we're seeing is with
fw_devlink misbehaving (because dev->fwnode->dev wasn't set early
enough). fw_devlink is a "new" (ish) feature, is officially optional,
and isn't used on all hardware.

...so I think there are enough reasons to explain why we're seeing
this now even though it seems like it's been wrong for a few decades.

-Doug