Re: [REGRESSION] asix: Lots of asix_rx_fixup() errors and slow transmissions
From: Dean Jenkins
Date: Wed May 11 2016 - 18:00:26 EST
Hi John,
I have purchased a "uGreen" USB Ethernet Adaptor which was reported as
showing the issue:
lsusb shows:
ID 0b95:772b ASIX Electronics Corp. AX88772B
dmesg shows:
[119591.413298] usb 2-1: new high-speed USB device number 12 using ci_hdrc
[119591.576970] usb 2-1: New USB device found, idVendor=0b95, idProduct=772b
[119591.576994] usb 2-1: New USB device strings: Mfr=1, Product=2,
SerialNumber=3
[119591.577010] usb 2-1: Product: AX88772C
[119591.577025] usb 2-1: Manufacturer: ASIX Elec. Corp.
Strangely the product string says "AX88772C" and lsusb shows "AX88772B"
I used our ARM (32-bit 2 core) board running our highly customised 3.14
kernel and ran a ping test that slowly increments the ping payload size
so forcing the Ethernet frames to slowly extend in length and eventually
forcing IPv4 fragmentation to occur due to the MTU limit of 1500. In my
test the ICMP ping payload lengths ranged from 1 to 5000.
During the test run I saw (only 3 errors):
[27455.113010] asix 2-1:1.0 eth0: asix_rx_fixup() Data Header
synchronisation was lost, remaining 23
[27455.113037] asix 2-1:1.0 eth0: asix_rx_fixup() Bad Header Length
0x77767574, offset 4
[27456.113269] asix 2-1:1.0 eth0: asix_rx_fixup() Data Header
synchronisation was lost, remaining 27
[27456.113329] asix 2-1:1.0 eth0: asix_rx_fixup() Bad Header Length
0x77767574, offset 4
[27457.113271] asix 2-1:1.0 eth0: asix_rx_fixup() Data Header
synchronisation was lost, remaining 30
[27457.113328] asix 2-1:1.0 eth0: asix_rx_fixup() Bad Header Length
0x77767574, offset 4
This meets my expectation of "sync lost" followed immediately by "Bad
Header Length". A close look at the timestamps shows gaps of around 20us
to 50us which suggests the code is processing the same URB eg. "sync
lost" and "Bad Header Length" are written from the same instance of
asix_rx_fixup_internal().
My example suggests that the previous URB went missing so data was lost
causing a discontinuity in the data stream. This was the intended
purpose of the commit to prevent bad Ethernet frames being sent up the
IP stack when an URB went missing. A bad Ethernet frame would otherwise
be created by having the start of an Ethernet frame appended with data
from the current URB causing a corrupted Ethernet frame to be generated
and sent up the IP stack.
Also the failure seems to be independent of the ping payload length but
longer test periods of specific payload lengths would be needed to allow
the 32 bit header word to move around relative to the start of the URB
buffer.
In my example, the "Bad Header Length 0x77767574" is reading the ping
payload data of 0x74, 0x75, 0x76, 077 which is located at the start of
the URB buffer. The remaining values are low at 23 to 30 which suggests
the end of the Ethernet frame was in the missing URB. The ICMP ping data
of 0x74, 0x75, 0x76, 077 is from the next Ethernet frame meaning the end
of the current Ethernet frame is missing and the next frame has a
missing start of Ethernet frame.
Note that due to IPv4 fragmentation "consecutive" Ethernet frames will
contain payloads of 1500 (MTU size) octets typically followed by a short
Ethernet frame. The payloads are fragmented IP packets.
So I've been trying to add some print messages here to better
understand whats going on.
Again, I'm a bit new to this code, so forgive me for my lack of
understanding things. Since the remaining value seems to be key, I
tried to look around and figure out where it was being set. It seems
like its only set in this function, is that right? So this made me
guess something might be happening in a previous iteration that was
causing this to trigger.
I added some debug prints to every time we set the remaining value, or
modify it, as well as to print the value if we enter the fixup
function with a non-zero remaining value.
When we set the remaining value, usually its to 1514, when the skblen is 1518.
1514 is Ethernet header length + payload at MTU size of 1500.
skblen of 1518 is 32-bit header word + maximum Ethernet frame length
(for your Network).
However, right before we catch the problem, I see this:
I am guessing where your debug is located in the code so I may have
ms-interpreted your information.
[ 84.844337] JDB set remaining to 1514 (skblen: 1518)
This suggests 1 maximum length Ethernet frame of 1514 octets in the URB
buffer.
[ 84.844379] JDB set remaining to 1514 (skblen: 1518)
[ 84.844429] JDB set remaining to 1514 (skblen: 1518)
[ 84.844458] JDB set remaining to 1514 (skblen: 1518)
[ 84.844483] JDB set remaining to 1514 (skblen: 1518)
[ 84.844507] JDB set remaining to 1514 (skblen: 1518)
[ 84.844559] JDB set remaining to 1514 (skblen: 2048)
This URB probably has 2 Ethernet frames; 1 complete frame plus the start
of the next Ethernet frame.
I think 2048 could be the maximum URB transfer length for the USB bulk
transfer. 2048 seems to be a low value so should be investigated.
[ 84.844583] JDB set remaining to 1514 (skblen: 2048)
[ 84.844595] JDB: 1514 > 2048 - 1522
[ 84.844606] JDB: dropping remaining by 526
32-bit header word + Ethernet frame of 1514 is consumed leaving 2048 -
(4+1514) = 530 octets
So there is room for another 32-bit header word and the start of a new
Ethernet frame within those 530 octets.
Start of Ethernet frame is 530 - 4 = 526 octets of Ethernet frame than
spans into the next URB buffer.
The remaining value is 988 which means the total Ethernet frame length
was 526 + 988 = 1514 which is a maximum length Ethernet frame
[ 84.844624] asix_rx_fixup_internal() remaining: 988, skb->len: 2048
[ 84.844672] asix 1-5:1.0 eth1: asix_rx_fixup() Data Header
synchronisation was lost, remaining 988
There should be a 32-bit header word located after the Ethernet frame
after location 988 but no 32-bit header was there. This is consistent
with a missing URB (there could be other reasons).
However, the gap between the timestamps is higher than expected at 273us:
84.844945 - 84.844672 = 0.000273 = 273us
It could be that "Sync lost" and "Bad Header" are not from the same
instance of asix_rx_fixup_internal() eg. processing different URBs
Therefore, it could be concluded that synchronisation has been recovered
by the "sync lost" detector forcing to look at the start of the URB
buffer for the 32-bit header word and a 32-bit header was found as no
immediate "Bad Header length" message was indicated. This also suggests
a URB was lost.
[ 84.844945] asix 1-5:1.0 eth1: asix_rx_fixup() Bad Header Length
0xdd5f8f9b, offset 4
This time an URB containing the start of Ethernet frame is missing
because remaining is 0 (as no "sync lost" message"). This means that
there is no 32-bit header word at the start of the URB buffer meaning no
start of an Ethernet frame. This is also consistent with a dropped URB
which contained the start of the now missing Ethernet frame.
[ 84.845217] asix 1-5:1.0 eth1: asix_rx_fixup() Bad Header Length
0x3ce1ad3c, offset 4
[ 84.845451] JDB set remaining to 1514 (skblen: 1518)
[ 84.845485] JDB set remaining to 1514 (skblen: 1518)
[ 84.845511] JDB set remaining to 1514 (skblen: 1518)
[ 84.851003] JDB set remaining to 1514 (skblen: 1518)
So when the issue happens, it seems to be due to an larger then usual
skb (2048). The first time through the wile loop we set the remaining
to 1514, but offset is fairly small, so we set copy_length to 1514,
and clear remaining. The offset is bumped by a little more then the
copy length and we loop again. Then the second time through we set
remaining to 1514, but since offset is bigger now, the if
(rx->remaining > skb->len - offset) case is true..
This is where it feels a little strange..
It is because you have to wait for the next URB to get the remaining
part of the Ethernet frame which occurs in the near future. eg. Ethernet
frame spanned across 2 URB buffers.
We calculate the copy_length as the difference between the offset and
the skb->len (so how much is left in the skb, which is 526), then
decrement remaining by that amount. Not really sure what remaining
(now 988) is supposed to represent here. We copy the 526 bytes, and
then exit the loop.
As explained above. The end of the current URB buffer contains the first
526 octets of the Ethernet frame and we have to wait for the next URB to
get the next part of the Ethernet frame.
The Ethernet frame length is 1514 so the remaining value is 1514 - 526 =
988. So the next URB should contain the missing 988 octets.
Now the next time we are called, we enter and we have a remaining
value still of 988, which triggers the header synchronization error
path.
The sync is lost because this new URB buffer is not containing the
expected 988 octets because the 32-bit header word after the end of the
Ethernet frame appears to be missing. Therefore, probably detected a
dropped URB which had the 988 octets of the end of the current Ethernet
frame.
Now, I'm not sure if the remainder handling logic is sane, or if the
skb->len being 2048 is problematic, or what. The skb->lens can vary
in sizes, usually 1518 during high throughput, but I've seen 1588,
1640, and other larger numbers that don't trigger the same problem.
I expect the USB bulk transfer is limited to a maximum transfer size of
2048 octets eg. max URB buffer size. The ASIX driver code should be
checked to see where the 2048 limit comes from. I wonder whether 4096
could be used ?
If the USB to Ethernet adaptor has more data to send than can fit in
2048 byte buffer then the Ethernet frame needs to be split across 2 URBs.
In the case of 1640 buffer, it probably means 2 Ethernet frames are in
the buffer:
32-bit header word + 1514 Ethernet frame + 32-bit header word + ((1640 -
(4 + 1514 +4)) = 118) Ethernet frame
This could be due to IP fragmentation.
ie:
[ 106.946473] JDB set remaining to 1514 (skblen: 1518)
[ 106.946525] JDB set remaining to 1514 (skblen: 1640)
[ 106.946546] JDB set remaining to 118 (skblen: 1640)
Yes, that 118 confirms my reasoning above.
[ 106.946586] JDB set remaining to 1514 (skblen: 1518)
So yea.. maybe that will help clue things in a bit? I'm still a bit lost. :)
Your observations are consistent with missing URBs from the USB host
controller.
Here is a summary of what I think is happening in your case:
Good case:
URB #1: 1514 octets of 1514 Ethernet frame (A)
URB #2: 1514 octets of 1514 Ethernet frame (B) + 526 octets of 1514
Ethernet frame (C)
URB #3: 988 octets of 1514 Ethernet frame (C)
URB #4: 1514 octets of 1514 Ethernet frame (D)
Therefore, Ethernet frame (C) is spanning URBs #2 and #3.
Bad case, URB #3 is lost:
URB #1: 1514 octets of 1514 Ethernet frame (A)
URB #2: 1514 octets of 1514 Ethernet frame (B) + 526 octets of 1514
Ethernet frame (C)
Remaining is 988
URB #4: 1514 octets of 1514 Ethernet frame (D)
But when URB #4 is analysed the 32-bit Header word is not found after
988 octets in the URB buffer so "sync lost".
The end of Ethernet frame (C) is missing so drop the Ethernet frame.
Now look at the start of the URB #4 buffer and find a 32-bit header word
so Ethernet frame (D) can be consumed.
So I think the commit is acting as intended and you are suffering from
lost URBs.
So perhaps you have a URB memory allocation issue for the USB host
controller ? However, this would also mean x86 has the same issue. So
perhaps it is a bug in common USB kernel code ?
You might find it helpful to get USB analyser logs or use the software
utility usbmon to grab the USB host controller data. You might then be
able to compare the data from the USB log with the URB buffer data. This
might confirm that URBs are going missing.
On the other hand, it was suggested that removing my commit improved
through-put. Is that really true ? Perhaps you can try an experiment
where you add test code to add a counter to use my "sync lost" detector
for 1000 URBs then by-pass the "sync lost" detector for 1000 URBs and
repeat it cyclically. 1000 is a guess, you might need a higher value.
Then run some tests to see whether the through-put periodically goes up
and down in sympathy with the "sync lost" detector being used or not
being used. Also track the RX error counter from ifconfig to see how
many bad Ethernet frames got to the IP stack.
Also try using wireshark to see whether you are generating IP fragmented
frames. Unfortunately, using iperf with IPv6 with defaults will generate
IPv6 fragmentation causing twice as many Ethernet frames to be sent as
needed so is inefficient. To fix that, in iperf reduce the data length
so that the MTU size of 1500 is not exceeded.
The commit only runs when Ethernet frames span URBs which typically
occurs when IP fragmentation happens or when there is a very high rate
of Ethernet frames per second.
At least, it is now clear why you see "988 remaining", that is because
you are using maximum length Ethernet frames and 2 of those frames are
written into a single URB which means 988 octets of the 2nd Ethernet
frame goes into the 2nd URB buffer. So that makes sense and is correct.
Regards,
Dean
thanks
-john
--
Dean Jenkins
Embedded Software Engineer
Linux Transportation Solutions
Mentor Embedded Software Division
Mentor Graphics (UK) Ltd.