Using coherent buffers (non-cacheable, allocated with usb_alloc_coherent),
I can get it to fail extremely regularly by simply reducing the buffer size
(agg_buf_sz) from 16KB down to 4KB. This makes reproducing the issue
much much easier -- the same problems do happen with the larger 16KB size,
but much less often than with smaller sizes.
So.. with a 4KB URB transfer_buffer size, along with a ton of added error-checking,
I see this behaviour every 10 (rx) URBs or so:
First URB (number 593):
[ 34.260667] r8152_rx_bottom: 593 corrupted urb: head=bf014000 urb_offset=2856/4096 pkt_len(1518) exceeds remainder(1216)
[ 34.271931] r8152_dump_rx_desc: 044805ee 40080000 006005dc 06020000 00000000 00000000 rx_len=1518
Next URB (number 594):
[ 34.281172] r8152_check_rx_desc: rx_desc looks bad.
[ 34.286228] r8152_rx_bottom: 594 corrupted urb. head=bf018000 urb_offset=0/304 len_used=24
[ 34.294774] r8152_dump_rx_desc: 00008300 00008400 00008500 00008600 00008700 00008800 rx_len=768
What the above sample shows, is the URB transfer buffer ran out of space in the middle
of a packet, and the hardware then tried to just continue that same packet in the next URB,
without an rx_desc header inserted. The r8152.c driver always assumes the URB buffer begins
with an rx_desc, so of course this behaviour produces really weird effects, and system crashes, etc..
So until that driver bug is addressed, I would advise disabling hardware RX checksums
for all chip versions, not only for version 02.
It is not clear to me how the chip decides when to forward an rx URB to the host.
If you could describe how that part works for us, then it would help in further
understanding why fast systems (eg. a PC) don't generally notice the issue,
while much slower embedded systems do see the issue regularly.