Re: [REGRESSION] asix: Lots of asix_rx_fixup() errors and slow transmissions
From: Dean Jenkins
Date: Tue May 03 2016 - 17:16:51 EST
On 03/05/16 15:42, David B. Robins wrote:
I don't think the first one is giving you problems (except as
triggered by the second) but I had concerns about the second myself
(and emailed the author off-list, but received no reply), and we did
not take that commit for our own product.
Sorry, I might have missed your original E-mail.
Specifically, the second change, 3f30... (original patch:
https://www.mail-archive.com/netdev@xxxxxxxxxxxxxxx/msg80720.html) (1)
appears to do the exact opposite of what it claims, i.e., instead of
"resync if this looks like a header", it does "resync if this does NOT
look like a (packet) header", where "looks like a header" means "bits
0-10 (size) are equal to the bitwise-NOT of bits 16-26", and (2) can
happen by coincidence for 1/2048 32-bit values starting a continuation
URB (easy to hit dealing with large volumes of video data as we were).
It appears to expect the header for every URB whereas the rest of the
code at least expects it only once per network packet (look at
following code that only reads it for remaining == 0).
David, I think that your interpretation is incorrect. Please see below.
Here is the code snippet from the patch with my annotations between # #,
I will try to explain my intentions. Feel free to point out any flaws:
if (rx->remaining && (rx->remaining + sizeof(u32) <= skb->len)) {
# Only runs when rx->remaining !=0 and the end of the Ethernet
frame + next 32-bit header word is within the URB buffer. #
# Therefore, this code does not run when the end of an Ethernet
frame has been reached in the previous URB #
# or when the end of the Ethernet frame + next 32-bit header
word will be in a later URB buffer #
# offset is an index to the expected next 32-bit header word
after the end of the Ethernet frame #
offset = ((rx->remaining + 1) & 0xfffe) + sizeof(u32);
# rx->header contains the expected 32-bit header value
corrected for Endianness and alignment #
rx->header = get_unaligned_le32(skb->data + offset);
offset = 0;
# check the data integrity of the size value from the header word #
size = (u16)(rx->header & 0x7ff);
# if the size value fails the integrity check then we are not
looking at a valid header word so #
# synchronisation has been lost #
if (size != ((~rx->header >> 16) & 0x7ff)) {
netdev_err(dev->net, "asix_rx_fixup() Data Header
synchronisation was lost, remaining %d\n",
rx->remaining);
if (rx->ax_skb) {
kfree_skb(rx->ax_skb);
rx->ax_skb = NULL;
/* Discard the incomplete netdev Ethernet frame
* and assume the Data header is at the start of
* the current URB socket buffer.
*/
}
rx->remaining = 0;
}
}
So that change made no sense to me, but I don't have significant
kernel dev experience. Effectively it will drop/truncate every
(2047/2048) split (longer than an URB) packet, and report an error for
the second URB and then again for treating said second URB as a first
URB for a packet. I would expect your problems will go away just
removing the second change. You could also change the != to == in "if
(size != ...)" but then you'd still have 1/2048 (depending on data
patterns) false positives.
The code only runs when the Ethernet frame spans across URBs and is
checking that the next 32-bit header word is present and valid.
Upon loss of synchronisation, the strategy is to assume that the 32-bit
header is at the start of the URB buffer. Obviously, that might not be
true every time but it is the most likely location especially when
Ethernet frames are not spanning URBs at that point at time.
Looking at the error messages:
[ 239.037310] asix 1-1.1:1.0 eth0: asix_rx_fixup() Bad Header Length
0x54ebb5ec, offset 4
The offset 4 means that the 32-bit header word was invalid at the start
of the URB buffer. This could be a consequence of data synchronisation
being lost however, we would expect the timestamps between the error
messages of "synchronisation was lost" and "Bad Header Length" to very
close as they would be consecutive URBs. The evidence is showing 10ms
gaps which does not suggest consecutive URBs. In other words, an
Ethernet frame should not be spanned over a time gap of 10ms as that
would be very inefficient. If that were true then there would be USB
communications problem with the USB to Ethernet adaptor which I hope is
not true.
[ 239.027993] asix 1-1.1:1.0 eth0: asix_rx_fixup() Data Header
synchronisation was lost, remaining 988
This error message consistently shows the remaining value to be 988, at
least for the 3 examples provided by John. This does not suggest a
random failure unless there are other examples of a non 988 remaining
value error message. 988 is well within a Ethernet frame length so seems
to be valid.
I think a good step would be to add some debug to print the
rx->remaining value at entry to asix_rx_fixup_internal(). This would
generate a lot of debug but a pattern of the values might emerge.
A good test would be to run "ping -c 1 -s $packet_length $ip_address"
inside a script which has a loop with an increasing payload length
$packet_length with a small delay between ping calls. This will show
whether particular packet sizes trigger the failures.
Then try with "ping -f -c 200 -s $packet_length $ip_address" to load up
the USB link.
Seems that I need kernel v4.4 or later to get a kernel with my patch in.
This will take me a few days to find time to rig something up to test...
Regards,
Dean
--
Dean Jenkins
Embedded Software Engineer
Linux Transportation Solutions
Mentor Embedded Software Division
Mentor Graphics (UK) Ltd.