Re: e1000e driver - hang after 4 hours of uptime - finally bisected!
From: Jeff Kirsher
Date: Thu Jun 18 2015 - 18:34:36 EST
On Thu, 2015-06-18 at 12:46 -0400, Valdis Kletnieks wrote:
> (follow up to a report from last week - bisecting took a while as I could
> only do 1 or 2 tests an evening)
>
> My Dell Latitude E6530 crashes with a specific kernel lockup almost
> exactly 4 hours after boot if there isn't a cable connected to the
> Ethernet port:
>
> [14508.846327] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
> [14468.229720] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
> [14463.254791] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
> [14491.134413] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 1
> [14463.396593] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2
> [14490.390223] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 1
> [14494.680591] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
> [14513.365378] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 1
> [14482.271716] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3
> [14479.906820] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
>
> As far as I can tell, the timestamp jitter is just how long it takes me to
> enter the cryptLUKS passphrase for the hard drive at boot...
>
> lspci tells me:
>
> lspci -vvv -s "00:19.0"
> 00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (rev 04)
> DeviceName: Onboard LAN
> Subsystem: Dell Device 0535
> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> Latency: 0
> Interrupt: pin A routed to IRQ 28
> Region 0: Memory at f7700000 (32-bit, non-prefetchable) [size=128K]
> Region 1: Memory at f7739000 (32-bit, non-prefetchable) [size=4K]
> Region 2: I/O ports at f040 [size=32]
> Capabilities: [c8] Power Management version 2
> Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
> Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
> Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
> Address: 00000000fee00318 Data: 0000
> Capabilities: [e0] PCI Advanced Features
> AFCap: TP+ FLR+
> AFCtrl: FLR-
> AFStatus: TP-
> Kernel driver in use: e1000e
>
>
> The traceback always looks like:
>
> [14479.906820] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
>
> [14479.906908] Call Trace:
> [14479.906914] <NMI> [<ffffffffba94db16>] dump_stack+0x50/0xa8
> [14479.906930] [<ffffffffba948bb9>] panic+0xcd/0x1e4
> [14479.906940] [<ffffffffba166a60>] ? perf_event_task_disable+0xc0/0xc0
> [14479.906952] [<ffffffffba125d8b>] watchdog_overflow_callback+0x9b/0xa0
> [14479.906959] [<ffffffffba16a684>] __perf_event_overflow+0xc4/0x1f0
> [14479.906968] [<ffffffffba16b3a4>] perf_event_overflow+0x14/0x20
> [14479.906976] [<ffffffffba022271>] intel_pmu_handle_irq+0x1e1/0x430
> [14479.906990] [<ffffffffba01a0f6>] perf_event_nmi_handler+0x26/0x40
> [14479.906999] [<ffffffffba0085b3>] nmi_handle+0x103/0x340
> [14479.907005] [<ffffffffba0084b5>] ? nmi_handle+0x5/0x340
> [14479.907017] [<ffffffffba008a53>] default_do_nmi+0xc3/0x120
> [14479.907032] [<ffffffffba008b98>] do_nmi+0xe8/0x130
> [14479.907044] [<ffffffffba95c9a8>] end_repeat_nmi+0x1e/0x2e
> [14479.907055] [<ffffffffba529886>] ? e1000e_cyclecounter_read+0x16/0xc0
> [14479.907061] [<ffffffffba529886>] ? e1000e_cyclecounter_read+0x16/0xc0
> [14479.907069] [<ffffffffba529886>] ? e1000e_cyclecounter_read+0x16/0xc0
> [14479.907075] <<EOE>> [<ffffffffba0e9529>] timecounter_read+0x19/0x60
> [14479.907088] [<ffffffffba53687e>] e1000e_phc_gettime+0x2e/0x60
> [14479.907098] [<ffffffffba536a31>] e1000e_systim_overflow_work+0x31/0x70
> [14479.907105] [<ffffffffba07ad19>] process_one_work+0x3c9/0x980
> [14479.907115] [<ffffffffba07ac62>] ? process_one_work+0x312/0x980
> [14479.907125] [<ffffffffba07b348>] ? worker_thread+0x78/0x760
> [14479.907134] [<ffffffffba07b59c>] worker_thread+0x2cc/0x760
> [14479.907144] [<ffffffffba07b2d0>] ? process_one_work+0x980/0x980
> [14479.907154] [<ffffffffba082a5e>] kthread+0xfe/0x120
> [14479.907163] [<ffffffffba08ca50>] ? finish_task_switch+0x50/0x1c0
> [14479.907173] [<ffffffffba082960>] ? kthread_create_on_node+0x270/0x270
> [14479.907179] [<ffffffffba95ae4f>] ret_from_fork+0x3f/0x70
> [14479.907188] [<ffffffffba082960>] ? kthread_create_on_node+0x270/0x270
> [14479.907243] Kernel Offset: 0x39000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>
> Bisection tells me it's this commit:
>
> commit 83129b37ef35bb6a7f01c060129736a8db5d31c4
> Author: Yanir Lubetkin <yanirx.lubetkin@xxxxxxxxx>
> Date: Tue Jun 2 17:05:45 2015 +0300
>
> e1000e: fix systim issues
>
> Two issues involving systim were reported.
> 1. Clock is not running in the correct frequency
> 2. In some situations, systim values were not incremented linearly
> This patch fixes the hardware clock configuration and the spurious
> non-linear increment.
Thanks Valdis! I will have Yanir look into it and hopefully we should
have a fix here soon for you to verify.
Attachment:
signature.asc
Description: This is a digitally signed message part