Re: [PATCH v2] driver core: Don't let a device probe until it's ready
From: Doug Anderson
Date: Wed Apr 01 2026 - 17:13:29 EST
Hi,
On Wed, Apr 1, 2026 at 2:06 PM Danilo Krummrich <dakr@xxxxxxxxxx> wrote:
>
> On Tue Mar 31, 2026 at 5:26 PM CEST, Doug Anderson wrote:
> > Hi,
> >
> > On Tue, Mar 31, 2026 at 7:42 AM Danilo Krummrich <dakr@xxxxxxxxxx> wrote:
> >>
> >> > @@ -848,6 +848,18 @@ static int __driver_probe_device(const struct device_driver *drv, struct device
> >> > if (dev->driver)
> >> > return -EBUSY;
> >> >
> >> > + /*
> >> > + * In device_add(), the "struct device" gets linked into the subsystem's
> >> > + * list of devices and broadcast to userspace (via uevent) before we're
> >> > + * quite ready to probe. Those open pathways to driver probe before
> >> > + * we've finished enough of device_add() to reliably support probe.
> >> > + * Detect this and tell other pathways to try again later. device_add()
> >> > + * itself will also try to probe immediately after setting
> >> > + * "ready_to_probe".
> >> > + */
> >> > + if (!dev->ready_to_probe)
> >> > + return dev_err_probe(dev, -EPROBE_DEFER, "Device not ready_to_probe");
> >>
> >> Are we sure this dev->ready_to_probe dance does not introduce a new subtle bug
> >> considering that ready_to_probe is within a bitfield of struct device?
> >>
> >> I.e. are we sure there are no potential concurrent modifications of other fields
> >> in this bitfield that are not protected with the device lock?
> >>
> >> For instance, in __driver_attach() we set dev->can_match if
> >> driver_match_device() returns -EPROBE_DEFER without the device lock held.
> >
> > Bleh. Thank you for catching this. I naively assumed the device lock
> > protected the bitfield, but I didn't verify that.
> >
> >
> >> This is exactly the case you want to protect against, i.e. device_add() racing
> >> with __driver_attach().
> >>
> >> So, there is a chance that the dev->ready_to_probe change gets interleaved with
> >> a dev->can_match change.
> >>
> >> I think all this goes away if we stop using bitfields for synchronization; we
> >> should convert some of those to flags that we can modify with set_bit() and
> >> friends instead.
> >
> > That sounds reasonable to me. Do you want me to send a v3 where I
> > create a new "unsigned long flags" in struct device and introduce this
> > as the first flag? If there are additional bitfields you want me to
> > convert, I can send them as additional patches in the series as long
> > as it's not too big of a change...
>
> I think the one with the biggest potential to cause real issues is can_match, as
> it is modified without the device lock held from __driver_attach(), which can be
> called at any time concurrently.
>
> (I think there are others as well, but they are more on the theoretical side of
> things. For instance, dma_skip_sync is modified by dma_set_mask(), which
> strictly speaking does not require the device lock to be held. In practice,
> that's probably never an issue since dma_set_mask() is typically called from bus
> callbacks usually, but it's not strictly a requirement.)
>
> More in general, from a robustness point of view, everything that is set once at
> device creation time is fine to be a bitfield; bits that are used for
> synchronization or are modified concurrently, I'd rather use bitops.
OK, thanks! I've almost finished with patches to fully move all of
them to bitops. This means we simply don't need to think about them.
Also: even if they're not truly needed as bitops, it's nice (and saves
space) not to have some of some flags using bitfields and some bitops
unless there's a good reason.
I'll make sure that "can_match" is second in the list of patches. If
you hate the idea of converting the other ones we can just apply the
earlier patches in the series and drop the rest. ;-)
-Doug