Re: Re: Re: [syzbot] INFO: rcu detected stall in tx

From: Alan Stern
Date: Thu May 06 2021 - 09:49:07 EST


On Wed, May 05, 2021 at 10:22:24PM +0000, Guido Kiener wrote:
> > -----Original Message-----
> > From: Alan Stern <stern@xxxxxxxxxxxxxxxxxxx>
> > Sent: Tuesday, May 4, 2021 5:14 PM
> > To: Kiener Guido 14DS1
> > Subject: Re: Re: [syzbot] INFO: rcu detected stall in tx
> >
> > On Mon, May 03, 2021 at 09:56:05PM +0000, Guido Kiener wrote:
> > > Hi all,
> > >
> > > Dave and I discussed the "self-detected stall on CPU" caused by the usbtmc
> > driver.
> > >
> > > What happened?
> > > The callback handler usbtmc_interrupt(struct urb *urb) for the INT pipe receives
> > an erroneous urb with status -EPROTO (-71).
> > > See
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tre
> > > e/drivers/usb/class/usbtmc.c?h=v5.12#n2340
> > > -EPROTO does not abort/shutdown the pipe and the urb is resubmitted to receive
> > the next packet. However the callback handler usbtmc_interrupt is called again with
> > the same erroneous status -EPROTO and this seems to result in an endless loop.
> > > According to
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tre
> > > e/Documentation/driver-api/usb/error-codes.rst?h=v5.12#n177
> > > the error -EPROTO indicates a hardware problem or a bad cable.
> > >
> > > Most usb drivers do not react in a specific way on this hardware problems and
> > resubmit the urb. We assume these drivers will run into the same endless loop.
> > Some other driver samples are:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tre
> > > e/drivers/usb/class/cdc-acm.c?h=v5.12#n379
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tre
> > > e/drivers/hid/usbhid/usbmouse.c?h=v5.12#n65
> > >
> > > Possible solutions:
> > > Hardware defects or bad cables seems to be a common problem for most usb
> > drivers and I assume we do not want to fix this problem in all class specific drivers,
> > but in lower level host drivers, e.g:
> > > 1. Using a counter and close the pipe after some detected errors 2.
> > > Delay the resubmission of the urb to avoid high cpu usage 3. Do
> > > nothing, since it is just a rare problem.
> > >
> > > We've never seen this problem in our products and we do not dare to change
> > anything.
> >
> > Drivers are not consistent in the way they handle these errors, as you have seen. A
> > few try to take active measures, such as retrys with increasing timeouts. Many
> > drivers just ignore them, which is not a very good idea.
> >
> > The general feeling among kernel USB developers is that a -EPROTO, -EILSEQ, or
> > -ETIME error should be regarded as fatal, much the same as an unplug event. The
> > driver should avoid resubmitting URBs and just wait to be unbound from the device.
>
> Thanks for your assessment. I agree with the general feeling. I counted about hundred
> specific usb drivers, so wouldn't it be better to fix the problem in some of the host drivers (e.g. urb.c)?
> We could return an error when calling usb_submit_urb() on an erroneous pipe.
> I cannot estimate the side effects and we need to check all drivers again how they deal with the
> error situation. Maybe there are some special driver that need a specialized error handling.
> In this case these drivers could reset the (new?) error flag to allow calling usb_submit_urb()
> again without error. This could work, isn't it?

That is feasible, although it would be an awkward approach. As you
said, the side effects aren't clear. But it might work.

> > If you would like to audit drivers and fix them up to behave this way, that would be
> > great.
>
> Currently not. I cannot pull the USB cable in home office :-), but I will keep an eye on it.
> When I'm more involved in the next USB driver issue than I will test bad cables and
> maybe get more ideas how we could test and fix this rare error.

Will you be able to test patches?

Alan Stern