RE: 3.19: ixgbe 0000:01:00.0 eth4: initiating reset due to tx timeout

From: Tantilov, Emil S
Date: Mon Feb 23 2015 - 11:43:53 EST

Next message: Christoph Lameter: "Re: [PATCH] capabilities: Ambient capability set V1"
Previous message: Christoph Lameter: "Re: [PATCH] capabilities: Ambient capability set V1"
In reply to: Justin Piszcz: "Re: 3.19: ixgbe 0000:01:00.0 eth4: initiating reset due to tx timeout"
Next in thread: Justin Piszcz: "RE: 3.19: ixgbe 0000:01:00.0 eth4: initiating reset due to tx timeout"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

>-----Original Message-----
>From: linux-kernel-owner@xxxxxxxxxxxxxxx [mailto:linux-kernel-owner@xxxxxxxxxxxxxxx] On Behalf Of Justin Piszcz
>Sent: Sunday, February 22, 2015 4:01 AM
>To: linux-kernel@xxxxxxxxxxxxxxx
>Subject: 3.19: ixgbe 0000:01:00.0 eth4: initiating reset due to tx timeout
>
>Hello,
>
>Kernel: 3.19.0
>Issue: When using robocopy to copy files (from Windows 8/8.1) to
>Linux/samba, the 10GbE NIC resets - dmesg [1] below. To get it back working
>again, I have to down/up the interface. Jumbo frames are being used (mtu of
>9014) on each side. The lspci output is listed below. Are there any other
>recommended workarounds for this issue as LRO is already off for me as shown
>below. When using Linux<->Linux with rsync or NFS, there are no errors with
>10GbE. When using Samba<->Windows 8 over 10GbE, this issue occurs
>persistently as shown below when a copy is running.
>
># ethtool -k eth4|grep large
>large-receive-offload: off [fixed]

The issue is a Tx timeout, so LRO is unlikely to have an effect. Is the interface that hangs (eth4) mostly receiving or transmitting? Posting the stats (ethtool -S eth4) would help here.

>There is/was a similar issue as reported here:
>https://communities.intel.com/message/207408
>
> [1] dmesg
>
> [538576.098186] ixgbe 0000:01:00.0 eth4: NIC Link is Up 10 Gbps, Flow Control: RX/TX
> [541013.223961] ------------[ cut here ]------------
> [541013.223970] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:303 dev_watchdog+0x227/0x230()
> [541013.223971] NETDEV WATCHDOG: eth4 (ixgbe): transmit queue 0 timed out
> [541013.223972] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.19.0 #2
> [541013.223973] Hardware name: Supermicro X9SRL-F/X9SRL-F, BIOS 3.0a 12/05/2013
> [541013.223974] ffffffff81d3a6ae ffff88107fc03da8 ffffffff819d07d7 ffffffff81e34d98
> [541013.223976] ffff88107fc03df8 ffff88107fc03de8 ffffffff810dbdab 0000000000000000
> [541013.223977] 0000000000000000 ffff881036304000 0000000000000000 0000000000000010
> [541013.223979] Call Trace:
> [541013.223979] <IRQ> [<ffffffff819d07d7>] dump_stack+0x45/0x57
> [541013.223985] [<ffffffff810dbdab>] warn_slowpath_common+0x7b/0xc0
> [541013.223987] [<ffffffff810dbe61>] warn_slowpath_fmt+0x41/0x50
> [541013.223990] [<ffffffff810eec4c>] ? __queue_work+0xfc/0x290
> [541013.223996] [<ffffffff818ef0a7>] dev_watchdog+0x227/0x230
> [541013.223997] [<ffffffff818eee80>] ? qdisc_rcu_free+0x40/0x40
> [541013.223998] [<ffffffff818eee80>] ? qdisc_rcu_free+0x40/0x40
> [541013.224001] [<ffffffff811251f7>] call_timer_fn.isra.29+0x17/0x80
> [541013.224002] [<ffffffff81125429>] run_timer_softirq+0x1c9/0x280
> [541013.224004] [<ffffffff810dec7f>] __do_softirq+0xff/0x200
> [541013.224005] [<ffffffff810deea6>] irq_exit+0x76/0xa0
> [541013.224007] [<ffffffff8106ac11>] smp_apic_timer_interrupt+0x41/0x50
> [541013.224009] [<ffffffff819da6aa>] apic_timer_interrupt+0x6a/0x70
> [541013.224009] <EOI> [<ffffffff8184e8f8>] ? cpuidle_enter_state+0x48/0xc0
> [541013.224013] [<ffffffff8184e8ed>] ? cpuidle_enter_state+0x3d/0xc0
> [541013.224014] [<ffffffff8184ea42>] cpuidle_enter+0x12/0x20
> [541013.224017] [<ffffffff8110f222>] cpu_startup_entry+0x272/0x2f0
> [541013.224018] [<ffffffff819cdd5d>] rest_init+0x6d/0x70
> [541013.224021] [<ffffffff81ef0dbb>] start_kernel+0x353/0x360
> [541013.224022] [<ffffffff81ef0495>] x86_64_start_reservations+0x2a/0x2c
> [541013.224023] [<ffffffff81ef055f>] x86_64_start_kernel+0xc8/0xcc
> [541013.224024] ---[ end trace 59877113cf8b7358 ]---
> [541013.224026] ixgbe 0000:01:00.0 eth4: initiating reset due to tx timeout
> [541013.224036] ixgbe 0000:01:00.0 eth4: Reset adapter
> [541020.099402] ixgbe 0000:01:00.0 eth4: NIC Link is Up 10 Gbps, Flow Control: RX/TX
>
> ( .. it continue but without the trace later .. )
>
> [567457.771728] ixgbe 0000:01:00.0 eth4: NIC Link is Down
> [567458.140112] ixgbe 0000:01:00.0 eth4: NIC Link is Up 10 Gbps, Flow Control: RX/TX
> [567561.611941] ixgbe 0000:01:00.0 eth4: NIC Link is Down
> [567568.188422] ixgbe 0000:01:00.0 eth4: NIC Link is Up 10 Gbps, Flow Control: RX/TX
> [570130.483823] ixgbe 0000:01:00.0 eth4: initiating reset due to tx timeout
> [570130.483924] ixgbe 0000:01:00.0 eth4: Reset adapter

The reset is a side effect of the Tx hang - the driver is trying to recover from the hang by resetting the interface.

If you could open up a ticket at e1000.sf.net with details about your setup and how you configure the interfaces that would help us get a better idea of the issue. You can also upload the stats, kernel config and any other logs that may be relevant.

Thanks,
Emil

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Christoph Lameter: "Re: [PATCH] capabilities: Ambient capability set V1"
Previous message: Christoph Lameter: "Re: [PATCH] capabilities: Ambient capability set V1"
In reply to: Justin Piszcz: "Re: 3.19: ixgbe 0000:01:00.0 eth4: initiating reset due to tx timeout"
Next in thread: Justin Piszcz: "RE: 3.19: ixgbe 0000:01:00.0 eth4: initiating reset due to tx timeout"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]