Re: [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting

From: Heikki Krogerus
Date: Fri Jun 11 2021 - 09:14:17 EST


On Fri, Jun 11, 2021 at 04:00:38PM +0300, Felipe Balbi wrote:
>
> Hi,
>
> Wesley Cheng <wcheng@xxxxxxxxxxxxxx> writes:
> >>>>>>> to be honest, I don't think these should go in (apart from the build
> >>>>>>> failure) because it's likely to break instantiations of the core with
> >>>>>>> differing FIFO sizes. Some instantiations even have some endpoints with
> >>>>>>> dedicated functionality that requires the default FIFO size configured
> >>>>>>> during coreConsultant instantiation. I know of at OMAP5 and some Intel
> >>>>>>> implementations which have dedicated endpoints for processor tracing.
> >>>>>>>
> >>>>>>> With OMAP5, these endpoints are configured at the top of the available
> >>>>>>> endpoints, which means that if a gadget driver gets loaded and takes
> >>>>>>> over most of the FIFO space because of this resizing, processor tracing
> >>>>>>> will have a hard time running. That being said, processor tracing isn't
> >>>>>>> supported in upstream at this moment.
> >>>>>>>
> >>>>>
> >>>>> I agree that the application of this logic may differ between vendors,
> >>>>> hence why I wanted to keep this controllable by the DT property, so that
> >>>>> for those which do not support this use case can leave it disabled. The
> >>>>> logic is there to ensure that for a given USB configuration, for each EP
> >>>>> it would have at least 1 TX FIFO. For USB configurations which don't
> >>>>> utilize all available IN EPs, it would allow re-allocation of internal
> >>>>> memory to EPs which will actually be in use.
> >>>>
> >>>> The feature ends up being all-or-nothing, then :-) It sounds like we can
> >>>> be a little nicer in this regard.
> >>>>
> >>>
> >>> Don't get me wrong, I think once those features become available
> >>> upstream, we can improve the logic. From what I remember when looking
> >>
> >> sure, I support that. But I want to make sure the first cut isn't likely
> >> to break things left and right :)
> >>
> >> Hence, let's at least get more testing.
> >>
> >
> > Sure, I'd hope that the other users of DWC3 will also see some pretty
> > big improvements on the TX path with this.
>
> fingers crossed
>
> >>> at Andy Shevchenko's Github, the Intel tracer downstream changes were
> >>> just to remove physical EP1 and 2 from the DWC3 endpoint list. If that
> >>
> >> right, that's the reason why we introduced the endpoint feature
> >> flags. The end goal was that the UDC would be able to have custom
> >> feature flags paired with ->validate_endpoint() or whatever before
> >> allowing it to be enabled. Then the UDC driver could tell UDC core to
> >> skip that endpoint on that particular platform without interefering with
> >> everything else.
> >>
> >> Of course, we still need to figure out a way to abstract the different
> >> dwc3 instantiations.
> >>
> >>> was the change which ended up upstream for the Intel tracer then we
> >>> could improve the logic to avoid re-sizing those particular EPs.
> >>
> >> The problem then, just as I mentioned in the previous paragraph, will be
> >> coming up with a solution that's elegant and works for all different
> >> instantiations of dwc3 (or musb, cdns3, etc).
> >>
> >
> > Well, at least for the TX FIFO resizing logic, we'd only be needing to
> > focus on the DWC3 implementation.
> >
> > You bring up another good topic that I'll eventually needing to be
> > taking a look at, which is a nice way we can handle vendor specific
> > endpoints and how they can co-exist with other "normal" endpoints. We
> > have a few special HW eps as well, which we try to maintain separately
> > in our DWC3 vendor driver, but it isn't the most convenient, or most
> > pretty method :).
>
> Awesome, as mentioned, the endpoint feature flags were added exactly to
> allow for these vendor-specific features :-)
>
> I'm more than happy to help testing now that I finally got our SM8150
> Surface Duo device tree accepted by Bjorn ;-)
>
> >>> However, I'm not sure how the changes would look like in the end, so I
> >>> would like to wait later down the line to include that :).
> >>
> >> Fair enough, I agree. Can we get some more testing of $subject, though?
> >> Did you test $subject with upstream too? Which gadget drivers did you
> >> use? How did you test
> >>
> >
> > The results that I included in the cover page was tested with the pure
> > upstream kernel on our device. Below was using the ConfigFS gadget w/ a
> > mass storage only composition.
> >
> > Test Parameters:
> > - Platform: Qualcomm SM8150
> > - bMaxBurst = 6
> > - USB req size = 256kB
> > - Num of USB reqs = 16
>
> do you mind testing with the regular request size (16KiB) and 250
> requests? I think we can even do 15 bursts in that case.
>
> > - USB Speed = Super-Speed
> > - Function Driver: Mass Storage (w/ ramdisk)
> > - Test Application: CrystalDiskMark
> >
> > Results:
> >
> > TXFIFO Depth = 3 max packets
> >
> > Test Case | Data Size | AVG tput (in MB/s)
> > -------------------------------------------
> > Sequential|1 GB x |
> > Read |9 loops | 193.60
> > | | 195.86
> > | | 184.77
> > | | 193.60
> > -------------------------------------------
> >
> > TXFIFO Depth = 6 max packets
> >
> > Test Case | Data Size | AVG tput (in MB/s)
> > -------------------------------------------
> > Sequential|1 GB x |
> > Read |9 loops | 287.35
> > | | 304.94
> > | | 289.64
> > | | 293.61
>
> I remember getting close to 400MiB/sec with Intel platforms without
> resizing FIFOs and I'm sure the FIFO size was set to 2x1024, though my
> memory could be failing.
>
> Then again, I never ran with CrystalDiskMark, I was using my own tool
> (it's somewhere in github. If you care, I can look up the URL).
>
> > We also have internal numbers which have shown similar improvements as
> > well. Those are over networking/tethering interfaces, so testing IPERF
> > loopback over TCP/UDP.
>
> loopback iperf? That would skip the wire, no?
>
> >>> size of 2 and TX threshold of 1, this would really be not beneficial to
> >>> us, because we can only change the TX threshold to 2 at max, and at
> >>> least in my observations, once we have to go out to system memory to
> >>> fetch the next data packet, that latency takes enough time for the
> >>> controller to end the current burst.
> >>
> >> What I noticed with g_mass_storage is that we can amortize the cost of
> >> fetching data from memory, with a deeper request queue. Whenever I
> >> test(ed) g_mass_storage, I was doing so with 250 requests. And that was
> >> enough to give me very good performance. Never had to poke at TX FIFO
> >> resizing. Did you try something like this too?
> >>
> >> I feel that allocating more requests is a far simpler and more generic
> >> method that changing FIFO sizes :)
> >>
> >
> > I wish I had a USB bus trace handy to show you, which would make it very
> > clear how the USB bus is currently utilized with TXFIFO size 2 vs 6. So
> > by increasing the number of USB requests, that will help if there was a
> > bottleneck at the SW level where the application/function driver
> > utilizing the DWC3 was submitting data much faster than the HW was
> > processing them.
> >
> > So yes, this method of increasing the # of USB reqs will definitely help
> > with situations such as HSUSB or in SSUSB when EP bursting isn't used.
> > The TXFIFO resize comes into play for SSUSB, which utilizes endpoint
> > bursting.
>
> Hmm, that's not what I remember. Perhaps the TRB cache size plays a role
> here too. I have clear memories of testing this very scenario of
> bursting (using g_mass_storage at the time) because I was curious about
> it. Back then, my tests showed no difference in behavior.
>
> It could be nice if Heikki could test Intel parts with and without your
> changes on g_mass_storage with 250 requests.

Andy, you have a system at hand that has the DWC3 block enabled,
right? Can you help out here?

thanks,


> > Now with endpoint bursting, if the function notifies the host that
> > bursting is supported, when the host sends the ACK for the Data Packet,
> > it should have a NumP value equal to the bMaxBurst reported in the EP
>
> Yes and no. Looking back at the history, we used to configure NUMP based
> on bMaxBurst, but it was changed later in commit
> 4e99472bc10bda9906526d725ff6d5f27b4ddca1 by yours truly because of a
> problem reported by John Youn.
>
> And now we've come full circle. Because even if I believe more requests
> are enough for bursting, NUMP is limited by the RxFIFO size. This ends
> up supporting your claim that we need RxFIFO resizing if we want to
> squeeze more throughput out of the controller.
>
> However, note that this is about RxFIFO size, not TxFIFO size. In fact,
> looking at Table 8-13 of USB 3.1 r1.0, we read the following about NumP
> (emphasis is mine):
>
> "Number of Packets (NumP). This field is used to indicate the
> number of Data Packet buffers that the **receiver** can
> accept. The value in this field shall be less than or equal to
> the maximum burst size supported by the endpoint as determined
> by the value in the bMaxBurst field in the Endpoint Companion
> Descriptor (refer to Section 9.6.7)."
>
> So, NumP is for the receiver, not the transmitter. Could you clarify
> what you mean here?
>
> /me keeps reading
>
> Hmm, table 8-15 tries to clarify:
>
> "Number of Packets (NumP).
>
> For an OUT endpoint, refer to Table 8-13 for the description of
> this field.
>
> For an IN endpoint this field is set by the endpoint to the
> number of packets it can transmit when the host resumes
> transactions to it. This field shall not have a value greater
> than the maximum burst size supported by the endpoint as
> indicated by the value in the bMaxBurst field in the Endpoint
> Companion Descriptor. Note that the value reported in this field
> may be treated by the host as informative only."
>
> However, if I remember correctly (please verify dwc3 databook), NUMP in
> DCFG was only for receive buffers. Thin, John, how does dwc3 compute
> NumP for TX/IN endpoints? Is that computed as a function of DCFG.NUMP or
> TxFIFO size?
>
> > desc. If we have a TXFIFO size of 2, then normally what I have seen is
> > that after 2 data packets, the device issues a NRDY. So then we'd need
> > to send an ERDY once data is available within the FIFO, and the same
> > sequence happens until the USB request is complete. With this constant
> > NRDY/ERDY handshake going on, you actually see that the bus is under
> > utilized. When we increase an EP's FIFO size, then you'll see constant
> > bursts for a request, until the request is done, or if the host runs out
> > of RXFIFO. (ie no interruption [on the USB protocol level] during USB
> > request data transfer)
>
> Unfortunately I don't have access to a USB sniffer anymore :-(
>
> >>>>>> Good points.
> >>>>>>
> >>>>>> Wesley, what kind of testing have you done on this on different devices?
> >>>>>>
> >>>>>
> >>>>> As mentioned above, these changes are currently present on end user
> >>>>> devices for the past few years, so its been through a lot of testing :).
> >>>>
> >>>> all with the same gadget driver. Also, who uses USB on android devices
> >>>> these days? Most of the data transfer goes via WiFi or Bluetooth, anyway
> >>>> :-)
> >>>>
> >>>> I guess only developers are using USB during development to flash dev
> >>>> images heh.
> >>>>
> >>>
> >>> I used to be a customer facing engineer, so honestly I did see some
> >>> really interesting and crazy designs. Again, we do have non-Android
> >>> products that use the same code, and it has been working in there for a
> >>> few years as well. The TXFIFO sizing really has helped with multimedia
> >>> use cases, which use isoc endpoints, since esp. in those lower end CPU
> >>> chips where latencies across the system are much larger, and a missed
> >>> ISOC interval leads to a pop in your ear.
> >>
> >> This is good background information. Thanks for bringing this
> >> up. Admitedly, we still have ISOC issues with dwc3. I'm interested in
> >> knowing if a deeper request queue would also help here.
> >>
> >> Remember dwc3 can accomodate 255 requests + link for each endpoint. If
> >> our gadget driver uses a low number of requests, we're never really
> >> using the TRB ring in our benefit.
> >>
> >
> > We're actually using both a deeper USB request queue + TX fifo resizing. :).
>
> okay, great. Let's see what John and/or Thinh respond WRT dwc3 TX Burst
> behavior.

--
heikki