RE: [PATCH v7 01/10] usb: gadget: udc: Add timer support for usb requests

From: Anurag Kumar Vulisha
Date: Tue Dec 04 2018 - 14:07:38 EST


Hi Alan,

>-----Original Message-----
>From: Alan Stern [mailto:stern@xxxxxxxxxxxxxxxxxxx]
>Sent: Tuesday, December 04, 2018 10:17 PM
>To: Anurag Kumar Vulisha <anuragku@xxxxxxxxxx>
>Cc: Felipe Balbi <balbi@xxxxxxxxxx>; Greg Kroah-Hartman
><gregkh@xxxxxxxxxxxxxxxxxxx>; Shuah Khan <shuah@xxxxxxxxxx>; Johan Hovold
><johan@xxxxxxxxxx>; Jaejoong Kim <climbbb.kim@xxxxxxxxx>; Benjamin
>Herrenschmidt <benh@xxxxxxxxxxxxxxxxxxx>; Roger Quadros <rogerq@xxxxxx>; Manu
>Gautam <mgautam@xxxxxxxxxxxxxx>; martin.petersen@xxxxxxxxxx; Bart Van
>Assche <bvanassche@xxxxxxx>; Mike Christie <mchristi@xxxxxxxxxx>; Matthew
>Wilcox <willy@xxxxxxxxxxxxx>; Colin Ian King <colin.king@xxxxxxxxxxxxx>; linux-
>usb@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; v.anuragkumar@xxxxxxxxx;
>Thinh Nguyen <thinhn@xxxxxxxxxxxx>; Tejas Joglekar
><tejas.joglekar@xxxxxxxxxxxx>; Ajay Yugalkishore Pandey <APANDEY@xxxxxxxxxx>
>Subject: RE: [PATCH v7 01/10] usb: gadget: udc: Add timer support for usb requests
>
>On Tue, 4 Dec 2018, Anurag Kumar Vulisha wrote:
>
>> >> Yes the host can cancel the transfer. This issue originated from
>> >> the endpoints using
>> >bulk
>> >> streaming protocol and may not occur with normal endpoints. AFAIK
>> >> bulk streaming
>> >is
>> >> gadget driven, where the gadget is allowed to select which stream
>> >> id transfer the
>> >host
>> >> should work on . Since the host doesn't have control on when the
>> >> transfer would be selected by gadget, it may wait for longer timeouts before
>cancelling the transfer.
>> >
>> >You're missing the point. Although the device selects which stream
>> >ID gets transferred, the _host_ decides whether a stream transfer
>> >should occur in the first place. No matter how many ERDY packets the
>> >device controller tries to send, no transfer will occur until the
>> >host wants to do it.
>> >
>> >In this sense, stream transfers (like all other USB interactions
>> >except wakeup requests) are host-driven.
>> >
>>
>> I agree with you that without host willing to transfer, the transfer
>> wouldn't have started and also agree the host always has the
>> capability of detecting the hang. But we are not sure on what host
>> platform and operating system the gadget would be tested and also not sure
>whether the host software is capable of handling the deadlock.
>> Even if the host has a timer , it would be very long and waiting for
>> the timer to get timedout would reduce the performance. In this patch
>> we are giving the user the flexibility to opt the timeout value, so
>> that the deadlock can be avoided without effecting the performance. So
>> I think adding the timer in gadget is more easier solution than handling in host. I
>have seen this workaround working for both linux & windows.
>
>Is there any way for the device controller driver to detect the problem without relying
>on a timeout?
>
>In fact, is this problem a known bug in the existing device controller hardware? Have
>the controller manufacturers published a suggested scheme for working around it?
>

Yes, this problem is mentioned in dwc3 controller data book and the workaround
mentioned is the same that we are doing , to implement timeout and if no valid
stream event is found , after timeout issue end transfer followed by start transfer.

>> >> >> Since the gadget
>> >> >> controller driver is aware that the controller is stuck , makes
>> >> >> it responsible to recover the controller from hang condition by
>> >> >> restarting the transfer (which triggers the controller FSM to issue ERDY to
>host).
>> >> >
>> >> >Isn't there a cleaner way to recover than by cancelling the
>> >> >request and resubmitting it?
>> >> >
>> >>
>> >> dequeuing the request issues the stop transfer command to the
>> >> controller, which cleans all the hardware resource allocated for
>> >> that endpoint. This also resets the hardware FSMs for that endpoint
>> >> . So, when re-queuing of the transfer happens the controller starts
>> >> allocating hardware resources again, thus avoiding the
>> >probability
>> >> of entering into the issue. I am not sure of other controllers, but
>> >> for dwc3, issuing the stop transfer is the only way to handle this issue.
>> >
>> >Again you're missing the point. Can't the controller driver issue
>> >the Stop Transfer command but still consider the request to be in
>> >progress (i.e., don't dequeue the request) so that the gadget
>> >driver's completion callback isn't invoked and the request does not
>> >need to be explicitly resubmitted?
>> >
>>
>> Yes , what you are saying is correct. My initial patches were
>> following the same approach that you have suggested. We switched to
>> the dequeue approach because there can be different gadgets which may
>> enter into this issue and it would be better to follow a generic
>> approach rather than having every controller driver implementing their own
>workaround.
>> Please find the conservation in this link
>> https://patchwork.kernel.org/patch/10640145/
>
>At this point, it seems that the generic approach will be messier than having every
>controller driver implement its own fix. At least, that's how it appears to me.

With the dequeue approach there is an advantage(not related to this issue) that the
class driver can have control of the timed out transfer whether to requeue it or consider
it as an error (based on implementation). This approach gives more flexibility to the class
drivers. This may be out of context of this issue but wanted to point out that we may lose
this advantage on switching to older implementation.

>
>> >> @Felipe: Can you please provide your suggestion on this.
>> >
>> >> >How can the gadget driver know what timeout to use? The host is
>> >> >allowed to be as slow as it wants; the gadget driver doesn't have
>> >> >any way to tell when the host wants to start the transfer.
>> >>
>> >> Yes , I agree with you that the timeout may vary from usage to
>> >> usage. This timeout should be decided by the class driver which
>> >> queues the request. As discussed above this issue was observed in
>> >> streaming protocol , which is very much faster than
>> >normal
>> >> BOT modes and it works on super speed .
>> >
>> >Although USB mass storage is currently the only user of the stream
>> >protocol, that may not be true in the future. You should think in
>> >more general terms. A timeout which is appropriate for mass storage
>> >may not be appropriate for some other application.
>> >
>>
>> You are correct , this timeout may not be ideal. Since I was not able
>> to reproduce this issue on non-stream capable transfers , at present
>> the only user of usb_ep_queue_timeout() API is the streaming gadget.
>> As we are not aware of the optimal timeout value, instead of
>> hardcoding the timeout value we can add module_param() for it. So that the user
>can configure timeout based on their use case. Please let me know your suggestion
>on this.
>
>Ideally it would not be necessary to rely on a timeout at all.
>
>Also, maintainers dislike module parameters. It would be better not to add one.

Okay. I would be happy if any alternative for this issue is present but unfortunately
I am not able to figure out any alternative other than timers. If not module_params()
we can add an configfs entry in stream gadget to update the timeout. Please provide
your opinion on this approach.

Thanks,
Anurag Kumar Vulisha