Re: [PATCH] usb: hub: Fix error loop seen after hub communication errors

From: Guenter Roeck
Date: Fri Mar 10 2017 - 18:38:11 EST


Hi Alan,

On 03/10/2017 01:14 PM, Alan Stern wrote:
On Wed, 8 Mar 2017, Guenter Roeck wrote:

While stress testing a usb controller using a bind/unbind looop, the
following error loop was observed.

usb 7-1.2: new low-speed USB device number 3 using xhci-hcd
usb 7-1.2: hub failed to enable device, error -108
usb 7-1-port2: cannot disable (err = -22)
usb 7-1-port2: couldn't allocate usb_device
usb 7-1-port2: cannot disable (err = -22)
hub 7-1:1.0: hub_ext_port_status failed (err = -22)
hub 7-1:1.0: hub_ext_port_status failed (err = -22)
hub 7-1:1.0: activate --> -22
hub 7-1:1.0: hub_ext_port_status failed (err = -22)
hub 7-1:1.0: hub_ext_port_status failed (err = -22)
hub 7-1:1.0: activate --> -22
...

This continues forever. After adding tracebacks into the code,
the call sequence leading to this is found to be as follows.

[<ffffffc0007fc8e0>] hub_activate+0x368/0x7b8
[<ffffffc0007fceb4>] hub_resume+0x2c/0x3c
[<ffffffc00080b3b8>] usb_resume_interface.isra.6+0x128/0x158
[<ffffffc00080b5d0>] usb_suspend_both+0x1e8/0x288
[<ffffffc00080c9c4>] usb_runtime_suspend+0x3c/0x98
[<ffffffc0007820a0>] __rpm_callback+0x48/0x7c
[<ffffffc00078217c>] rpm_callback+0xa8/0xd4
[<ffffffc000786234>] rpm_suspend+0x84/0x758
[<ffffffc000786ca4>] rpm_idle+0x2c8/0x498
[<ffffffc000786ed4>] __pm_runtime_idle+0x60/0xac
[<ffffffc00080eba8>] usb_autopm_put_interface+0x6c/0x7c
[<ffffffc000803798>] hub_event+0x10ac/0x12ac
[<ffffffc000249bb8>] process_one_work+0x390/0x6b8
[<ffffffc00024abcc>] worker_thread+0x480/0x610
[<ffffffc000251a80>] kthread+0x164/0x178
[<ffffffc0002045d0>] ret_from_fork+0x10/0x40

kick_hub_wq() is called from hub_activate() even after failures to
communicate with the hub. This results in an endless sequence of
hub event -> hub activate -> wq trigger -> hub event -> ...
To break the loop, only trigger the hub event queue if communication
with the hub is successful.

Signed-off-by: Guenter Roeck <linux@xxxxxxxxxxxx>
---
drivers/usb/core/hub.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
index 5286bf67869a..f22ab428b310 100644
--- a/drivers/usb/core/hub.c
+++ b/drivers/usb/core/hub.c
@@ -1197,7 +1197,8 @@ static void hub_activate(struct usb_hub *hub, enum hub_activation_type type)
&hub->leds, LED_CYCLE_PERIOD);

/* Scan all ports that need attention */
- kick_hub_wq(hub);
+ if (!status)
+ kick_hub_wq(hub);

if (type == HUB_INIT2 || type == HUB_INIT3) {
/* Allow autosuspend if it was suppressed */

I'm not at all sure this is the best solution. status == 0 means that
an URB was successfully _submitted_; it does not mean that future
communication with the hub will be successful.

The problem arises here because the system is unable to runtime-suspend
the hub (you didn't say why, but probably because of the same
communication problem). When a suspend fails, drivers are informed

Yes, exactly.

that the device is back at full power, and of course the hub driver
tries to determine the hub's current state at that time. When this
fails, there's no reason to keep the hub at full power any longer, so
the system tries to do another runtime suspend. This suspend fails
just like the first one, and there's your loop.


If we don't cut the loop in hub_activate(), where else would be
a good place to cut it ?

This same kind of thing could happen with any driver. In your case,
the problem could be solved by making sure that when the hub fails to
suspend, it is still at least minimally functional. We don't do this
now; when usb_port_suspend fails there is no checking to see whether
the device is still working.

(It would be interesting to know exactly at which point
usb_port_suspend fails during your test. And why didn't your system
mark the hub as disconnected when the host controller was unbound?)

It does, but that happens a bit later. If I disable console logging
(for example by using dev_err_ratelimited() instead of dev_err() for the
observed errors), the hub is marked as disabled after a few hundred
or a few thousand loops. Only that code never gets to run runs if console
logging is enabled, and to restrict logging didn't seem to be a good
solution for the problem.

The test script I am running is:

while true
do
for i in /sys/bus/platform/drivers/rockchip-dwc3/usb*; do
basename $i > $(dirname $i)/unbind
basename $i > $(dirname $i)/bind
done
sleep $(awk "BEGIN {printf \"%.2f\", $((RANDOM % 150)) / 100 + 0.5}")
done

where rockchip-dwc3 is a slightly more complex version of dwc3-of-simple.c,
with pretty much the same functionality, only that it ties into extcon
to be able to handle cable states. The hub disconnect is a result of a call
to of_platform_depopulate(), similar to the same call in dwc3-of-simple.c.

The sleep in the above script is essential; it results in all kinds of
race conditions. The one I am trying to fix here is just one of them. The
system under test has two Type-C ports; one is connected to an Ethernet
interface and the other to an Apple Type-C adapter (it probably doesn't
matter much what is connected, as long as there is a hub).

An even worse problem is what to do when the device is working okay but
runtime suspend still fails (for example, if remote wakeup cannot be
enabled). How do we prevent the system from going into a similar loop
then?


Good question. I am open to suggestions.

Thanks,
Guenter