Re: r8169 hang on 4.18

From: Ortwin GlÃck
Date: Tue Sep 25 2018 - 05:53:42 EST


On 24.09.18 22:21, Heiner Kallweit wrote:
Thanks for the report. Here come a few inquiries:

You say the box has one on-board network port and four network ports on
an extension card, all five driven by r8169. The on-board chip is a
RTL8168h, what's the type of the chips on the extension card?

This is the card:
03:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 0c)
Subsystem: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:0123]
Flags: bus master, fast devsel, latency 0, IRQ 17
I/O ports at d000 [size=256]
Memory at f7300000 (64-bit, non-prefetchable) [size=4K]
Memory at f0300000 (64-bit, prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [70] Express Endpoint, MSI 01
Capabilities: [b0] MSI-X: Enable- Count=4 Masked-
Capabilities: [d0] Vital Product Data
Capabilities: [100] Advanced Error Reporting
Capabilities: [140] Virtual Channel
Capabilities: [160] Device Serial Number 01-00-00-00-68-4c-e0-00
Capabilities: [170] Latency Tolerance Reporting
Kernel driver in use: r8169

04:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 0c)
Subsystem: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:0123]
Flags: bus master, fast devsel, latency 0, IRQ 18
I/O ports at c000 [size=256]
Memory at f7200000 (64-bit, non-prefetchable) [size=4K]
Memory at f0200000 (64-bit, prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [70] Express Endpoint, MSI 01
Capabilities: [b0] MSI-X: Enable- Count=4 Masked-
Capabilities: [d0] Vital Product Data
Capabilities: [100] Advanced Error Reporting
Capabilities: [140] Virtual Channel
Capabilities: [160] Device Serial Number 01-00-00-00-68-4c-e0-00
Capabilities: [170] Latency Tolerance Reporting
Kernel driver in use: r8169

05:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 0c)
Subsystem: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:0123]
Flags: bus master, fast devsel, latency 0, IRQ 19
I/O ports at b000 [size=256]
Memory at f7100000 (64-bit, non-prefetchable) [size=4K]
Memory at f0100000 (64-bit, prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [70] Express Endpoint, MSI 01
Capabilities: [b0] MSI-X: Enable- Count=4 Masked-
Capabilities: [d0] Vital Product Data
Capabilities: [100] Advanced Error Reporting
Capabilities: [140] Virtual Channel
Capabilities: [160] Device Serial Number 01-00-00-00-68-4c-e0-00
Capabilities: [170] Latency Tolerance Reporting
Kernel driver in use: r8169

06:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 0c)
Subsystem: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:0123]
Flags: bus master, fast devsel, latency 0, IRQ 16
I/O ports at a000 [size=256]
Memory at f7000000 (64-bit, non-prefetchable) [size=4K]
Memory at f0000000 (64-bit, prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [70] Express Endpoint, MSI 01
Capabilities: [b0] MSI-X: Enable- Count=4 Masked-
Capabilities: [d0] Vital Product Data
Capabilities: [100] Advanced Error Reporting
Capabilities: [140] Virtual Channel
Capabilities: [160] Device Serial Number 01-00-00-00-68-4c-e0-00
Capabilities: [170] Latency Tolerance Reporting
Kernel driver in use: r8169



I'm asking because r8169 supports ~ 50 chip variants of the RTL8169/8
family.
Are the problems the same on all five ports?

Yes, I see the WARN_ON also on the other ports. Here is one from the separate card:
[Mon Sep 24 15:47:21 2018] ------------[ cut here ]------------
[Mon Sep 24 15:47:21 2018] NETDEV WATCHDOG: lan (r8169): transmit queue 0 timed out
[Mon Sep 24 15:47:21 2018] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:461 dev_watchdog+0x215/0x220
[Mon Sep 24 15:47:21 2018] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.18.8 #70
[Mon Sep 24 15:47:21 2018] Hardware name: Dell Inc. OptiPlex 3050/0W0CHX, BIOS 1.6.5 09/09/2017
[Mon Sep 24 15:47:21 2018] RIP: 0010:dev_watchdog+0x215/0x220
[Mon Sep 24 15:47:21 2018] Code: 49 63 4c 24 e8 eb 8c 4c 89 ef c6 05 1a 19 ca 00 01 e8 5f 52 fd ff 89 d9 4c 89 ee 48 c7 c7 78 ab 67 af 48 89 c2 e8 1b 2b 49 ff <0f> 0b eb be 0f 1f 80 00 00 00 00 41 57 45 89 cf 41 56 49 89 d6 41
[Mon Sep 24 15:47:21 2018] RSP: 0018:ffff8a205dd03e98 EFLAGS: 00010282
[Mon Sep 24 15:47:21 2018] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006
[Mon Sep 24 15:47:21 2018] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8a205dd15350
[Mon Sep 24 15:47:21 2018] RBP: ffff8a2045de041c R08: 0000000000000001 R09: 0000000000000763
[Mon Sep 24 15:47:21 2018] R10: 0000000000000082 R11: 0000000000000000 R12: ffff8a2045de0438
[Mon Sep 24 15:47:21 2018] R13: ffff8a2045de0000 R14: 0000000000000001 R15: ffff8a20455e1480
[Mon Sep 24 15:47:21 2018] FS: 0000000000000000(0000) GS:ffff8a205dd00000(0000) knlGS:0000000000000000
[Mon Sep 24 15:47:21 2018] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Mon Sep 24 15:47:21 2018] CR2: 00007ff598642c38 CR3: 000000001f80a004 CR4: 00000000003606e0
[Mon Sep 24 15:47:21 2018] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[Mon Sep 24 15:47:21 2018] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[Mon Sep 24 15:47:21 2018] Call Trace:
[Mon Sep 24 15:47:21 2018] <IRQ>
[Mon Sep 24 15:47:21 2018] ? pfifo_fast_reset+0x130/0x130
[Mon Sep 24 15:47:21 2018] ? pfifo_fast_reset+0x130/0x130
[Mon Sep 24 15:47:21 2018] call_timer_fn+0x11/0x70
[Mon Sep 24 15:47:21 2018] expire_timers+0x8e/0xa0
[Mon Sep 24 15:47:21 2018] run_timer_softirq+0xb9/0x160
[Mon Sep 24 15:47:21 2018] ? __hrtimer_run_queues+0x135/0x1a0
[Mon Sep 24 15:47:21 2018] ? hw_breakpoint_pmu_read+0x10/0x10
[Mon Sep 24 15:47:21 2018] ? ktime_get+0x39/0x90
[Mon Sep 24 15:47:21 2018] ? lapic_next_event+0x20/0x20
[Mon Sep 24 15:47:21 2018] __do_softirq+0xcb/0x1f8
[Mon Sep 24 15:47:21 2018] irq_exit+0xa9/0xb0
[Mon Sep 24 15:47:21 2018] smp_apic_timer_interrupt+0x59/0x90
[Mon Sep 24 15:47:21 2018] apic_timer_interrupt+0xf/0x20
[Mon Sep 24 15:47:21 2018] </IRQ>
[Mon Sep 24 15:47:21 2018] RIP: 0010:cpuidle_enter_state+0x129/0x200
[Mon Sep 24 15:47:21 2018] Code: 45 00 89 c3 e8 d8 3b 55 ff 65 8b 3d b1 eb 45 51 e8 8c 3a 55 ff 31 ff 49 89 c4 e8 72 43 55 ff fb 48 ba cf f7 53 e3 a5 9b c4 20 <4c> 89 e1 4c 29 e9 48 89 c8 48 c1 f9 3f 48 f7 ea b8 ff ff ff 7f 48
[Mon Sep 24 15:47:21 2018] RSP: 0018:ffff9f46806e7ea8 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
[Mon Sep 24 15:47:21 2018] RAX: ffff8a205dd1f800 RBX: 0000000000000004 RCX: 000000000000001f
[Mon Sep 24 15:47:21 2018] RDX: 20c49ba5e353f7cf RSI: 00000000258f0602 RDI: 0000000000000000
[Mon Sep 24 15:47:21 2018] RBP: ffff8a205dd25ee0 R08: 00000000000003ee R09: 00000000ffffffff
[Mon Sep 24 15:47:21 2018] R10: ffff9f46806e7e90 R11: 00000000000003d5 R12: 00000752028ec2d0
[Mon Sep 24 15:47:21 2018] R13: 000007520282ea18 R14: 0000000000000004 R15: 0000000000000000
[Mon Sep 24 15:47:21 2018] ? cpuidle_enter_state+0x11e/0x200
[Mon Sep 24 15:47:21 2018] do_idle+0x1c0/0x200
[Mon Sep 24 15:47:21 2018] cpu_startup_entry+0x6a/0x70
[Mon Sep 24 15:47:21 2018] start_secondary+0x18a/0x1c0
[Mon Sep 24 15:47:21 2018] secondary_startup_64+0xa5/0xb0
[Mon Sep 24 15:47:21 2018] ---[ end trace 7ea59db719c41930 ]---
[Mon Sep 24 15:47:21 2018] r8169 0000:06:00.0 lan: link up


Can you reproduce the problem (how)? Any specific network usage
triggering the problem?

Actually the WARN_ON seems not to be causing the hang but to be rather a side effect of an "unstable" port. The WARN_ON occurs every few days but the port works fine most of the time. But when the port hangs (no more traffic) only a reboot fixes it.

These looks suspicious:

[ 8048.829832] r8169 0000:06:00.0 lan: link up
[64114.535455] r8169 0000:07:00.0 wan: link up
[74726.396319] r8169 0000:07:00.0 wan: link up

The ports seem to go up at random times (without "link down" message). This is also what we saw with MSI enabled: links thrashing in down/up cycles.

The root cause of the problem not necessarily is in r8169, some other
change could have broken it too. Can you test using r8169 from 4.18
on top of 4.17?

Unfortunately I am not able to do that.

When stating "behaves erratic" you refer to the network hangs
mentioned before? Or to some other issue?

please see above: links thrashing in down/up cycles.
I see that MSI/MSI-X has been disabled on quite a few chip versions (e.g. lately commit 72a579b8).
Does pci=nomsi disable MSI-X as well, or just MSI?
I could try adding my chip versions to these quirk sections. I couldn't find a way to disable MSI/MSI-X per device.

A similar report is here:
https://bugzilla.kernel.org/show_bug.cgi?id=201109

Also that report has the "link up" message after the WARN_ON.

There the problem seems to start with the upgrade from 4.18.4 to 4.18.5.
Can you try with 4.18.4 ?

The diff between 4.18.4 and 4.18.5 shows nothing related to r8169.

Exactly. I think the incident is just so rare that correlating it with a specific upgrade is hard.

Thanks,

Ortwin