On 6/12/2021 2:27 PM, Ferry Toth wrote:
HiHi Ferry,
Op 11-06-2021 om 15:21 schreef Andy Shevchenko:
On Fri, Jun 11, 2021 at 4:14 PM Heikki KrogerusI just tested on 5.13.0-rc4 on Intel Edison (x86_64). All 5 patches
<heikki.krogerus@xxxxxxxxxxxxxxx> wrote:
On Fri, Jun 11, 2021 at 04:00:38PM +0300, Felipe Balbi wrote:I'm not sure if i will have time soon, I Cc'ed to Ferry who has a few
Hi,Andy, you have a system at hand that has the DWC3 block enabled,
Wesley Cheng <wcheng@xxxxxxxxxxxxxx> writes:
fingers crossedSure, I'd hope that the other users of DWC3 will also see some prettysure, I support that. But I want to make sure the first cut isn'tDon't get me wrong, I think once those features become availableThe feature ends up being all-or-nothing, then :-) It soundsI agree that the application of this logic may differ betweento be honest, I don't think these should go in (apart from
the build
failure) because it's likely to break instantiations of the
core with
differing FIFO sizes. Some instantiations even have some
endpoints with
dedicated functionality that requires the default FIFO size
configured
during coreConsultant instantiation. I know of at OMAP5 and
some Intel
implementations which have dedicated endpoints for processor
tracing.
With OMAP5, these endpoints are configured at the top of the
available
endpoints, which means that if a gadget driver gets loaded
and takes
over most of the FIFO space because of this resizing,
processor tracing
will have a hard time running. That being said, processor
tracing isn't
supported in upstream at this moment.
vendors,
hence why I wanted to keep this controllable by the DT
property, so that
for those which do not support this use case can leave it
disabled. The
logic is there to ensure that for a given USB configuration,
for each EP
it would have at least 1 TX FIFO. For USB configurations which
don't
utilize all available IN EPs, it would allow re-allocation of
internal
memory to EPs which will actually be in use.
like we can
be a little nicer in this regard.
upstream, we can improve the logic. From what I remember when
looking
likely
to break things left and right :)
Hence, let's at least get more testing.
big improvements on the TX path with this.
Awesome, as mentioned, the endpoint feature flags were added exactly toWell, at least for the TX FIFO resizing logic, we'd only be needing toat Andy Shevchenko's Github, the Intel tracer downstream changesright, that's the reason why we introduced the endpoint feature
were
just to remove physical EP1 and 2 from the DWC3 endpoint list.
If that
flags. The end goal was that the UDC would be able to have custom
feature flags paired with ->validate_endpoint() or whatever before
allowing it to be enabled. Then the UDC driver could tell UDC core to
skip that endpoint on that particular platform without
interefering with
everything else.
Of course, we still need to figure out a way to abstract the
different
dwc3 instantiations.
was the change which ended up upstream for the Intel tracer then weThe problem then, just as I mentioned in the previous paragraph,
could improve the logic to avoid re-sizing those particular EPs.
will be
coming up with a solution that's elegant and works for all different
instantiations of dwc3 (or musb, cdns3, etc).
focus on the DWC3 implementation.
You bring up another good topic that I'll eventually needing to be
taking a look at, which is a nice way we can handle vendor specific
endpoints and how they can co-exist with other "normal" endpoints. We
have a few special HW eps as well, which we try to maintain separately
in our DWC3 vendor driver, but it isn't the most convenient, or most
pretty method :).
allow for these vendor-specific features :-)
I'm more than happy to help testing now that I finally got our SM8150
Surface Duo device tree accepted by Bjorn ;-)
do you mind testing with the regular request size (16KiB) and 250The results that I included in the cover page was tested with the pureHowever, I'm not sure how the changes would look like in the end,Fair enough, I agree. Can we get some more testing of $subject,
so I
would like to wait later down the line to include that :).
though?
Did you test $subject with upstream too? Which gadget drivers did you
use? How did you test
upstream kernel on our device. Below was using the ConfigFS gadget
w/ a
mass storage only composition.
Test Parameters:
- Platform: Qualcomm SM8150
- bMaxBurst = 6
- USB req size = 256kB
- Num of USB reqs = 16
requests? I think we can even do 15 bursts in that case.
- USB Speed = Super-SpeedI remember getting close to 400MiB/sec with Intel platforms without
- Function Driver: Mass Storage (w/ ramdisk)
- Test Application: CrystalDiskMark
Results:
TXFIFO Depth = 3 max packets
Test Case | Data Size | AVG tput (in MB/s)
-------------------------------------------
Sequential|1 GB x |
Read |9 loops | 193.60
| | 195.86
| | 184.77
| | 193.60
-------------------------------------------
TXFIFO Depth = 6 max packets
Test Case | Data Size | AVG tput (in MB/s)
-------------------------------------------
Sequential|1 GB x |
Read |9 loops | 287.35
| | 304.94
| | 289.64
| | 293.61
resizing FIFOs and I'm sure the FIFO size was set to 2x1024, though my
memory could be failing.
Then again, I never ran with CrystalDiskMark, I was using my own tool
(it's somewhere in github. If you care, I can look up the URL).
We also have internal numbers which have shown similar improvements asloopback iperf? That would skip the wire, no?
well. Those are over networking/tethering interfaces, so testing
IPERF
loopback over TCP/UDP.
Hmm, that's not what I remember. Perhaps the TRB cache size plays aI wish I had a USB bus trace handy to show you, which would make itsize of 2 and TX threshold of 1, this would really be notWhat I noticed with g_mass_storage is that we can amortize the
beneficial to
us, because we can only change the TX threshold to 2 at max, and at
least in my observations, once we have to go out to system memory to
fetch the next data packet, that latency takes enough time for the
controller to end the current burst.
cost of
fetching data from memory, with a deeper request queue. Whenever I
test(ed) g_mass_storage, I was doing so with 250 requests. And
that was
enough to give me very good performance. Never had to poke at TX FIFO
resizing. Did you try something like this too?
I feel that allocating more requests is a far simpler and more
generic
method that changing FIFO sizes :)
very
clear how the USB bus is currently utilized with TXFIFO size 2 vs
6. So
by increasing the number of USB requests, that will help if there
was a
bottleneck at the SW level where the application/function driver
utilizing the DWC3 was submitting data much faster than the HW was
processing them.
So yes, this method of increasing the # of USB reqs will definitely
help
with situations such as HSUSB or in SSUSB when EP bursting isn't used.
The TXFIFO resize comes into play for SSUSB, which utilizes endpoint
bursting.
role
here too. I have clear memories of testing this very scenario of
bursting (using g_mass_storage at the time) because I was curious about
it. Back then, my tests showed no difference in behavior.
It could be nice if Heikki could test Intel parts with and without your
changes on g_mass_storage with 250 requests.
right? Can you help out here?
more test cases (I have only one or two) and maybe can help. But I'll
keep this in mind.
apply. Switching between host/gadget works, no connections dropping, no
errors in dmesg.
In host mode I connect a smsc9504 eth+4p hub. In gadget mode I have
composite device created from configfs with gser / eem / mass_storage /
uac2.
Tested with iperf3 performance in host (93.6Mbits/sec) and gadget
(207Mbits/sec) mode. Compared to v5.10.41 without patches host
(93.4Mbits/sec) and gadget (198Mbits/sec).
Gadget seems to be a little faster with the patches, but that might also
be caused by something else, on v5.10.41 I see the bitrate bouncing
between 207 and 199.
I saw a mention to test iperf3 to self (loopback). 3.09 Gbits/sec. With
v5.10.41 3.07Gbits/sec. Not bad for a 500MHz device.
With gnome-disks I did a read access benchmark 35.4MB/s, with v5.10.41
34.7MB/s. This might be limited by Edison's internal eMMC speed (when
booting U-Boot reads the kernel with 21.4 MiB/s).
Thanks for the testing. Just to double check, did you also enable the
property, which enabled the TXFIFO resize feature on the platform? For
example, for the QCOM SM8150 platform, we're adding the following to our
device tree node:
tx-fifo-resize
If not, then your results at least confirms that w/o the property
present, the changes won't break anything :). Thanks again for the
initial testing!
Thanks
Wesley Cheng
--Now with endpoint bursting, if the function notifies the host thatYes and no. Looking back at the history, we used to configure NUMP
bursting is supported, when the host sends the ACK for the Data
Packet,
it should have a NumP value equal to the bMaxBurst reported in the EP
based
on bMaxBurst, but it was changed later in commit
4e99472bc10bda9906526d725ff6d5f27b4ddca1 by yours truly because of a
problem reported by John Youn.
And now we've come full circle. Because even if I believe more requests
are enough for bursting, NUMP is limited by the RxFIFO size. This ends
up supporting your claim that we need RxFIFO resizing if we want to
squeeze more throughput out of the controller.
However, note that this is about RxFIFO size, not TxFIFO size. In fact,
looking at Table 8-13 of USB 3.1 r1.0, we read the following about NumP
(emphasis is mine):
"Number of Packets (NumP). This field is used to indicate the
number of Data Packet buffers that the **receiver** can
accept. The value in this field shall be less than or equal to
the maximum burst size supported by the endpoint as determined
by the value in the bMaxBurst field in the Endpoint Companion
Descriptor (refer to Section 9.6.7)."
So, NumP is for the receiver, not the transmitter. Could you clarify
what you mean here?
/me keeps reading
Hmm, table 8-15 tries to clarify:
"Number of Packets (NumP).
For an OUT endpoint, refer to Table 8-13 for the description of
this field.
For an IN endpoint this field is set by the endpoint to the
number of packets it can transmit when the host resumes
transactions to it. This field shall not have a value greater
than the maximum burst size supported by the endpoint as
indicated by the value in the bMaxBurst field in the Endpoint
Companion Descriptor. Note that the value reported in this field
may be treated by the host as informative only."
However, if I remember correctly (please verify dwc3 databook), NUMP in
DCFG was only for receive buffers. Thin, John, how does dwc3 compute
NumP for TX/IN endpoints? Is that computed as a function of
DCFG.NUMP or
TxFIFO size?
desc. If we have a TXFIFO size of 2, then normally what I haveUnfortunately I don't have access to a USB sniffer anymore :-(
seen is
that after 2 data packets, the device issues a NRDY. So then we'd
need
to send an ERDY once data is available within the FIFO, and the same
sequence happens until the USB request is complete. With this
constant
NRDY/ERDY handshake going on, you actually see that the bus is under
utilized. When we increase an EP's FIFO size, then you'll see
constant
bursts for a request, until the request is done, or if the host
runs out
of RXFIFO. (ie no interruption [on the USB protocol level] during USB
request data transfer)
okay, great. Let's see what John and/or Thinh respond WRT dwc3 TX BurstWe're actually using both a deeper USB request queue + TX fifoThis is good background information. Thanks for bringing thisI used to be a customer facing engineer, so honestly I did see someall with the same gadget driver. Also, who uses USB on androidGood points.As mentioned above, these changes are currently present on end
Wesley, what kind of testing have you done on this on
different devices?
user
devices for the past few years, so its been through a lot of
testing :).
devices
these days? Most of the data transfer goes via WiFi or
Bluetooth, anyway
:-)
I guess only developers are using USB during development to
flash dev
images heh.
really interesting and crazy designs. Again, we do have non-Android
products that use the same code, and it has been working in there
for a
few years as well. The TXFIFO sizing really has helped with
multimedia
use cases, which use isoc endpoints, since esp. in those lower
end CPU
chips where latencies across the system are much larger, and a
missed
ISOC interval leads to a pop in your ear.
up. Admitedly, we still have ISOC issues with dwc3. I'm interested in
knowing if a deeper request queue would also help here.
Remember dwc3 can accomodate 255 requests + link for each
endpoint. If
our gadget driver uses a low number of requests, we're never really
using the TRB ring in our benefit.
resizing. :).
behavior.
heikki