Re: [PATCH v2] cxl/core/port: defer endpoint probes when ACPI likely hasn't finished
From: Gregory Price
Date: Fri Oct 04 2024 - 22:05:31 EST
On Fri, Oct 04, 2024 at 05:08:03PM -0700, Dan Williams wrote:
> Gregory Price wrote:
> > In cxl_acpi_probe, we add dports and uports to host bridges iteratively:
> > - bus_for_each_dev(adev->dev.bus, NULL, root_port, add_host_bridge_dport);
> > - bus_for_each_dev(adev->dev.bus, NULL, root_port, add_host_bridge_uport);
> >
> > Simultaneously, as ports are probed, memdev endpoints can also be
> > probed. This creates a race condition, where an endpoint can perceive
> > its path to the root being broken in devm_cxl_enumerate_ports.
> >
> > The memdev/endpoint probe will see a heirarchy that may look like:
> > mem1
> > parent => 0000:c1:00.0
> > parent => 0000:c0:01.1
> > parent->parent => NULL
> >
> > This results in find_cxl_port() returning NULL (since the port hasn't
> > been associated with the host bridge yet), and add_port_attach_ep
> > fails because the grandparent's grandparent is NULL.
> >
> > When the latter condition is detected, the comments suggest:
> > /*
> > * The iteration reached the topology root without finding the
> > * CXL-root 'cxl_port' on a previous iteration, fail for now to
> > * be re-probed after platform driver attaches.
> > */
> >
> > This case results in an -ENXIO; however, a re-probe never occurs. Change
> > this return condition to -EPROBE_DEFER to explicitly cause a reprobe.
>
> Ok, thanks for the additional debug. Like we chatted on the CXL Discord
> I think this is potentially pointing to a bug in bus_rescan_devices()
> where it checks dev->driver without holding the lock.
>
> Can you give this fix a try to see if it also resolves the issue?
> Effectively, cxl_bus_rescan() is always needed in case the cxl_acpi
> driver loads waaaay after deferred probing has given up, and if this
> works then EPROBE_DEFER can remain limited to cases where it is
> absolutely known that no other device_attach() kick is coming to save
> the day.
>
Funny enough, not only did it not work, but now i get neither endpoint lol
$ ls /sys/bus/cxl/devices/
decoder0.0 decoder1.0 decoder2.0 decoder3.1 mem0 port1 port3 root0
decoder0.1 decoder1.1 decoder3.0 decoder4.0 mem1 port2 port4
w/ reprobe patch
# ls /sys/bus/cxl/devices
decoder0.0 decoder1.0 decoder2.0 decoder3.1 decoder5.0 decoder6.0 endpoint5 mem0 port1 port3 root0
decoder0.1 decoder1.1 decoder3.0 decoder4.0 decoder5.1 decoder6.1 endpoint6 mem1 port2 port4
> -- 8< --
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 1d5007e3795a..6c0cd94888a3 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -2088,11 +2088,18 @@ static void cxl_bus_remove(struct device *dev)
>
> static struct workqueue_struct *cxl_bus_wq;
>
> -static void cxl_bus_rescan_queue(struct work_struct *w)
> +static int attach_device(struct device *dev, void *data)
> {
> - int rc = bus_rescan_devices(&cxl_bus_type);
> + int rc = device_attach(dev);
> +
> + dev_vdbg(dev, "rescan: %s\n", rc ? "attach" : "detached");
>
> - pr_debug("CXL bus rescan result: %d\n", rc);
> + return 0;
> +}
> +
> +static void cxl_bus_rescan_queue(struct work_struct *w)
> +{
> + bus_for_each_dev(&cxl_bus_type, NULL, NULL, attach_device);
> }
>
> void cxl_bus_rescan(void)