Re: r8169 hang on 4.18

From: Heiner Kallweit
Date: Mon Sep 24 2018 - 16:21:39 EST


On 24.09.2018 14:00, Ortwin GlÃck wrote:
> Hi,
>
> Stable kernel has stability problems on r8169 that were not present in 4.17.3:
>
> [ÂÂÂ 0.000000] Linux version 4.18.8 (kbuild@lofw) (gcc version 7.3.0 (Gentoo 7.3.0-r3 p1.4)) #70 SMP PREEMPT Mon Sep 17 17:56:57 CEST 2018
> [ÂÂÂ 0.000000] Command line: BOOT_IMAGE=/boot/linux-4.18.8 root=LABEL=ROOT ro rootfstype=ext4 net.ifnames=0 pci=nomsi
>
> [ÂÂÂ 1.772849] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
> [ÂÂÂ 1.772852] r8169 0000:07:00.0: can't disable ASPM; OS doesn't have ASPM control
> [ÂÂÂ 1.784948] r8169 0000:07:00.0 eth4: RTL8168h/8111h, 50:9a:4c:2e:92:be, XID 54100800, IRQ 16
> [ÂÂÂ 1.784949] r8169 0000:07:00.0 eth4: jumbo features [frames: 9200 bytes, tx checksumming: ko]
>
> We saw the interface unresponsive twice during the last 3 days with:
>
> [Mon Sep 24 11:35:56 2018] ------------[ cut here ]------------
> [Mon Sep 24 11:35:56 2018] NETDEV WATCHDOG: wan (r8169): transmit queue 0 timed out
> [Mon Sep 24 11:35:56 2018] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:461 dev_watchdog+0x215/0x220
> [Mon Sep 24 11:35:56 2018] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.18.8 #70
> [Mon Sep 24 11:35:56 2018] Hardware name: Dell Inc. OptiPlex 3050/0W0CHX, BIOS 1.6.5 09/09/2017
> [Mon Sep 24 11:35:56 2018] RIP: 0010:dev_watchdog+0x215/0x220
> [Mon Sep 24 11:35:56 2018] Code: 49 63 4c 24 e8 eb 8c 4c 89 ef c6 05 1a 19 ca 00 01 e8 5f 52 fd ff 89 d9 4c 89 ee 48 c7 c7 78 ab 67 89 48 89 c2 e8 1b 2b 49 ff <0f> 0b eb be 0f 1f 80 00 00 00 00 41 57 45 89 cf 41 56 49 89 d6 41
> [Mon Sep 24 11:35:56 2018] RSP: 0018:ffff96f05dd03e98 EFLAGS: 00010282
> [Mon Sep 24 11:35:56 2018] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006
> [Mon Sep 24 11:35:56 2018] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff96f05dd15350
> [Mon Sep 24 11:35:56 2018] RBP: ffff96f0462ee41c R08: 0000000000000001 R09: 000000000000133d
> [Mon Sep 24 11:35:56 2018] R10: 0000000000000202 R11: 0000000000000000 R12: ffff96f0462ee438
> [Mon Sep 24 11:35:56 2018] R13: ffff96f0462ee000 R14: 0000000000000001 R15: ffff96f0455eaa80
> [Mon Sep 24 11:35:56 2018] FS:Â 0000000000000000(0000) GS:ffff96f05dd00000(0000) knlGS:0000000000000000
> [Mon Sep 24 11:35:56 2018] CS:Â 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [Mon Sep 24 11:35:56 2018] CR2: 000055c9498766e0 CR3: 00000000bb80a006 CR4: 00000000003606e0
> [Mon Sep 24 11:35:56 2018] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [Mon Sep 24 11:35:56 2018] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [Mon Sep 24 11:35:56 2018] Call Trace:
> [Mon Sep 24 11:35:56 2018]Â <IRQ>
> [Mon Sep 24 11:35:56 2018]Â ? pfifo_fast_reset+0x130/0x130
> [Mon Sep 24 11:35:56 2018]Â ? pfifo_fast_reset+0x130/0x130
> [Mon Sep 24 11:35:56 2018]Â call_timer_fn+0x11/0x70
> [Mon Sep 24 11:35:56 2018]Â expire_timers+0x8e/0xa0
> [Mon Sep 24 11:35:56 2018]Â run_timer_softirq+0xb9/0x160
> [Mon Sep 24 11:35:56 2018]Â ? __hrtimer_run_queues+0x135/0x1a0
> [Mon Sep 24 11:35:56 2018]Â ? hw_breakpoint_pmu_read+0x10/0x10
> [Mon Sep 24 11:35:56 2018]Â ? ktime_get+0x39/0x90
> [Mon Sep 24 11:35:56 2018]Â ? lapic_next_event+0x20/0x20
> [Mon Sep 24 11:35:56 2018]Â __do_softirq+0xcb/0x1f8
> [Mon Sep 24 11:35:56 2018]Â irq_exit+0xa9/0xb0
> [Mon Sep 24 11:35:56 2018]Â smp_apic_timer_interrupt+0x59/0x90
> [Mon Sep 24 11:35:56 2018]Â apic_timer_interrupt+0xf/0x20
> [Mon Sep 24 11:35:56 2018]Â </IRQ>
> [Mon Sep 24 11:35:56 2018] RIP: 0010:cpuidle_enter_state+0x129/0x200
> [Mon Sep 24 11:35:56 2018] Code: 45 00 89 c3 e8 d8 3b 55 ff 65 8b 3d b1 eb 45 77 e8 8c 3a 55 ff 31 ff 49 89 c4 e8 72 43 55 ff fb 48 ba cf f7 53 e3 a5 9b c4 20 <4c> 89 e1 4c 29 e9 48 89 c8 48 c1 f9 3f 48 f7 ea b8 ff ff ff 7f 48
> [Mon Sep 24 11:35:56 2018] RSP: 0018:ffff9a93c06e7ea8 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
> [Mon Sep 24 11:35:56 2018] RAX: ffff96f05dd1f800 RBX: 0000000000000003 RCX: 000000000000001f
> [Mon Sep 24 11:35:56 2018] RDX: 20c49ba5e353f7cf RSI: 00000000258f0602 RDI: 0000000000000000
> [Mon Sep 24 11:35:56 2018] RBP: ffff96f05dd25ee0 R08: 00000000000002b4 R09: 00000000ffffffff
> [Mon Sep 24 11:35:56 2018] R10: ffff9a93c06e7e90 R11: 0000000000000142 R12: 00012ec849a182b9
> [Mon Sep 24 11:35:56 2018] R13: 00012ec8499ddf88 R14: 0000000000000003 R15: 0000000000000000
> [Mon Sep 24 11:35:56 2018]Â ? cpuidle_enter_state+0x11e/0x200
> [Mon Sep 24 11:35:56 2018]Â do_idle+0x1c0/0x200
> [Mon Sep 24 11:35:56 2018]Â cpu_startup_entry+0x6a/0x70
> [Mon Sep 24 11:35:56 2018]Â start_secondary+0x18a/0x1c0
> [Mon Sep 24 11:35:56 2018]Â secondary_startup_64+0xa5/0xb0
> [Mon Sep 24 11:35:56 2018] ---[ end trace 327bd9c035abe307 ]---
>
> This is the built-in ethernet port on a Dell main board:
> 07:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 15)
> ÂÂÂÂÂÂÂ Subsystem: Dell RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [1028:07a3]
> ÂÂÂÂÂÂÂ Flags: bus master, fast devsel, latency 0, IRQ 16
> ÂÂÂÂÂÂÂ I/O ports at e000 [size=256]
> ÂÂÂÂÂÂÂ Memory at f7404000 (64-bit, non-prefetchable) [size=4K]
> ÂÂÂÂÂÂÂ Memory at f7400000 (64-bit, non-prefetchable) [size=16K]
> ÂÂÂÂÂÂÂ Capabilities: [40] Power Management version 3
> ÂÂÂÂÂÂÂ Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
> ÂÂÂÂÂÂÂ Capabilities: [70] Express Endpoint, MSI 01
> ÂÂÂÂÂÂÂ Capabilities: [b0] MSI-X: Enable- Count=4 Masked-
> ÂÂÂÂÂÂÂ Capabilities: [100] Advanced Error Reporting
> ÂÂÂÂÂÂÂ Capabilities: [140] Virtual Channel
> ÂÂÂÂÂÂÂ Capabilities: [160] Device Serial Number 01-00-00-00-68-4c-e0-00
> ÂÂÂÂÂÂÂ Capabilities: [170] Latency Tolerance Reporting
> ÂÂÂÂÂÂÂ Capabilities: [178] L1 PM Substates
> ÂÂÂÂÂÂÂ Kernel driver in use: r8169
>
> The box has an extra 4-way ethernet card that uses the same driver. We had to set pci=nomsi because the card frequently behaved erratic with msi on.
>
> Thanks,
>
> Ortwin
>
Thanks for the report. Here come a few inquiries:

You say the box has one on-board network port and four network ports on
an extension card, all five driven by r8169. The on-board chip is a
RTL8168h, what's the type of the chips on the extension card?
I'm asking because r8169 supports ~ 50 chip variants of the RTL8169/8
family.
Are the problems the same on all five ports?

Can you reproduce the problem (how)? Any specific network usage
triggering the problem?

The root cause of the problem not necessarily is in r8169, some other
change could have broken it too. Can you test using r8169 from 4.18
on top of 4.17?

When stating "behaves erratic" you refer to the network hangs
mentioned before? Or to some other issue?

A similar report is here:
https://bugzilla.kernel.org/show_bug.cgi?id=201109
There the problem seems to start with the upgrade from 4.18.4 to 4.18.5.
Can you try with 4.18.4 ?

The diff between 4.18.4 and 4.18.5 shows nothing related to r8169.

Rgds, Heiner