Re: r8169 regression: UDP packets dropped intermittantly

From: Jonathan Woithe
Date: Tue Dec 19 2017 - 00:45:48 EST


Hi again

This is a follow up to my earlier message.

On Tue, Dec 19, 2017 at 09:02:25AM +1030, Jonathan Woithe wrote:
> On Mon, Dec 18, 2017 at 02:38:53PM +0100, Holger Hoffstätte wrote:
> > Since I've seen your postings several times now with no comment or resolution
> > I've decided to try your reproducer on my own systems. In short, I cannot
> > reproduce any packet loss, despite having 2 (cheap) 1Gb switches between the
> > two machines. Both are running 4.14.7.
>
> Thanks for trying the test program on your system. The result indicates
> that the problem might be specific to the behaviour of a particular network
> variant of the r8169 chip.

I was able to temporarily acquire a PCIe card which uses the r8169 driver.
This allowed me to run the reproducer on the same machine with two different
r8169-based cards. The original NIC is this:

05:01.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169
Gigabit Ethernet (rev 10) [10ec:8169]
Subsystem: Netgear GA311 [1385:311a]

The PCIe card is this:

02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B
PCI Express Gigabit Ethernet controller (rev 06) [10ec:8168]
Subsystem: Realtek Semiconductor Co., Ltd. Device 0123 [10ec:0123]

The test was conducted with kernel 4.3.0 since both the 4.3.0 driver (which
triggers the fault) and the forward ported driver (which predates commit
da78dbff2e05630921c551dbbc70a4b7981a8fff) was available. For the record,
the machine used as the slave in these tests (the one receiving the 6 byte
request and sending the 14 byte response) was using its onboard NIC:

00:19.0 Ethernet controller: Intel Corporation 82579V Gigabit Network
Connection (rev 05) [8086:1503]
Subsystem: Gigabyte Technology Co., Ltd 82579V Gigabit Network
Connection [1458:e000]

Test outcomes were as follows:

PCIe card, unpatched 4.3.0 r8169 driver: no error (tested for 1 hour)
PCIe card, forward ported r8169 driver: no error (tested for 1 hour)

GA311 card, unpatched 4.3.0 r8169 driver: test fail in under 4 minutes
GA311 card, forward ported r8169 driver: no error (tested for 1 hour)

For completeness, I then booted 4.14 and repeated the test with its r8168
driver. The PCIe card ran for an hour without triggering the error, while
the GA311 triggered it quickly (in under 3 minutes).

This clearly indicates that not every card using the r8169 driver is
vulnerable to the problem. It also explains why Holger was unable to
reproduce the result on his system: the PCIe cards do not appear to suffer
from the problem. Most likely the PCI RTL-8169 chip is affected, but newer
PCIe variations do not. However, obviously more testing will be required
with a wider variety of cards if this inference is to hold up.

The above result (and those from Holger) allow the problem description to be
refined a little: changes in commit da78dbff2e05630921c551dbbc70a4b7981a8fff
cause GA311 NICs (and possibly other PCI cards using an RTL-8169) to have
trouble with small UDP packets, while PCIe variants are seemingly
unaffected.

Does this help?

Regards
jonathan