Re: [PATCH] igb: Fix igb_down hung on surprise removal

From: Stefan Schaeckeler
Date: Wed Jun 05 2024 - 20:05:27 EST


Hi all,

This commit introduced a regression. What does not work with this commit is an AER without a surprise removal event.


On 5/18/23 00:26, Ying Hsu wrote:
> In a setup where a Thunderbolt hub connects to Ethernet and a display
> through USB Type-C, users may experience a hung task timeout when they
> remove the cable between the PC and the Thunderbolt hub.
> This is because the igb_down function is called multiple times when
> the Thunderbolt hub is unplugged. For example, the igb_io_error_detected
> triggers the first call, and the igb_remove triggers the second call.
> The second call to igb_down will block at napi_synchronize.
> Here's the call trace:
> __schedule+0x3b0/0xddb
> ? __mod_timer+0x164/0x5d3
> schedule+0x44/0xa8
> schedule_timeout+0xb2/0x2a4
> ? run_local_timers+0x4e/0x4e
> msleep+0x31/0x38
> igb_down+0x12c/0x22a [igb 6615058754948bfde0bf01429257eb59f13030d4]
> __igb_close+0x6f/0x9c [igb 6615058754948bfde0bf01429257eb59f13030d4]
> igb_close+0x23/0x2b [igb 6615058754948bfde0bf01429257eb59f13030d4]
> __dev_close_many+0x95/0xec
> dev_close_many+0x6e/0x103
> unregister_netdevice_many+0x105/0x5b1
> unregister_netdevice_queue+0xc2/0x10d
> unregister_netdev+0x1c/0x23
> igb_remove+0xa7/0x11c [igb 6615058754948bfde0bf01429257eb59f13030d4]
> pci_device_remove+0x3f/0x9c
> device_release_driver_internal+0xfe/0x1b4
> pci_stop_bus_device+0x5b/0x7f
> pci_stop_bus_device+0x30/0x7f
> pci_stop_bus_device+0x30/0x7f
> pci_stop_and_remove_bus_device+0x12/0x19
> pciehp_unconfigure_device+0x76/0xe9
> pciehp_disable_slot+0x6e/0x131
> pciehp_handle_presence_or_link_change+0x7a/0x3f7
> pciehp_ist+0xbe/0x194
> irq_thread_fn+0x22/0x4d
> ? irq_thread+0x1fd/0x1fd
> irq_thread+0x17b/0x1fd
> ? irq_forced_thread_fn+0x5f/0x5f
> kthread+0x142/0x153
> ? __irq_get_irqchip_state+0x46/0x46
> ? kthread_associate_blkcg+0x71/0x71
> ret_from_fork+0x1f/0x30
>
> In this case, igb_io_error_detected detaches the network interface
> and requests a PCIE slot reset, however, the PCIE reset callback is
> not being invoked and thus the Ethernet connection breaks down.
> As the PCIE error in this case is a non-fatal one, requesting a
> slot reset can be avoided.
> This patch fixes the task hung issue and preserves Ethernet
> connection by ignoring non-fatal PCIE errors.
>
> Signed-off-by: Ying Hsu <yinghsu@xxxxxxxxxxxx>
> ---
> This commit has been tested on a HP Elite Dragonfly Chromebook and
> a Caldigit TS3+ Thunderbolt hub. The Ethernet driver for the hub
> is igb. Non-fatal PCIE errors happen when users hot-plug the cables
> connected to the chromebook or to the external display.
>
> drivers/net/ethernet/intel/igb/igb_main.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
> index 58872a4c2540..a8b217368ca1 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -9581,6 +9581,11 @@ static pci_ers_result_t igb_io_error_detected(struct pci_dev *pdev,
> struct net_device *netdev = pci_get_drvdata(pdev);
> struct igb_adapter *adapter = netdev_priv(netdev);
>
> + if (state == pci_channel_io_normal) {
> + dev_warn(&pdev->dev, "Non-correctable non-fatal error reported.\n");
> + return PCI_ERS_RESULT_CAN_RECOVER;
> + }
> +
> netif_device_detach(netdev);
>
> if (state == pci_channel_io_perm_failure)

We are currently stuck with the 5.4 kernel - our embedded system can't easily boot arbitrary kernel versions. The igb driver code has not changed and I'm quite positive the issue still persist in the latest upstream kernel.

This issue is reproducible with aer_inject. 09:00.0 is our i210 which is directly connected to the cpu root port 00:01.1:


- - - snip - - -
[node0_RP0_CPU0:~]$cat aer.log
AER
PCI_ID 09:00.0
UNCOR_STATUS COMP_TIME
HEADER_LOG 0 1 2 3
- - - snip - - -

- - - snip - - -
[node0_RP0_CPU0:~]$aer-inject aer.log
[369145.900845] pcieport 0000:00:01.1: aer_inject: Injecting errors 00000000/00004000 into device 0000:09:00.0
[369145.912726] pcieport 0000:00:01.1: AER: Uncorrected (Non-Fatal) error received: 0000:09:00.0
[369145.923124] igb 0000:09:00.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[369145.936791] igb 0000:09:00.0: AER: device [8086:1537] error status/mask=00004000/00000000
[369145.947068] igb 0000:09:00.0: AER: [14] CmpltTO
[369145.954602] igb 0000:09:00.0: Non-correctable non-fatal error reported.
[369145.984564] ------------[ cut here ]------------
[369145.990285] kernel BUG at include/linux/netdevice.h:529!
[369145.996860] invalid opcode: 0000 [#1] SMP PTI
[369146.002267] CPU: 3 PID: 142 Comm: irq/26-aerdrv Kdump: loaded Tainted: G O 5.4.251-yocto-standard #1
[369146.015073] Hardware name: Cisco System Inc. SF-D8/Type2 - Board Product Name1, BIOS 1-29-g46d9e72a-s 05/03/2019
[369146.027570] RIP: 0010:igb_up+0x51/0x160 [igb]
[369146.032974] Code: d2 eb 16 f0 80 60 60 fe f0 80 60 60 f7 48 83 c2 01 39 93 14 02 00 00 76 13 48 8b 84 d3 08 0f 00 00 48 8b 48 60 83 e1 01 75 d9 <0f> 0b f6 83 11 02 00 00 20 0f 85 c0 00 00 00 48 8b bb 08 0f 00 00
[369146.055938] RSP: 0018:ffffb29b0045bcf0 EFLAGS: 00010246
[369146.062399] RAX: ffff8d938d99e400 RBX: ffff8d9398a08740 RCX: 0000000000000000
[369146.071186] RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff8d938d99e2c0
[369146.079974] RBP: ffffb29b0045bd00 R08: 00000000000002ec R09: ffffb29b002fa000
[369146.088761] R10: 0000000000006f80 R11: 000000000090ef2c R12: ffff8d9398a08000
[369146.097547] R13: ffff8d9398a08740 R14: ffff8d939ae92c00 R15: ffff8d939ae92c28
[369146.106335] FS: 0000000000000000(0000) GS:ffff8d939fcc0000(0000) knlGS:0000000000000000
[369146.116286] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[369146.123381] CR2: 000056318eb418e8 CR3: 000000084e744002 CR4: 00000000003606e0
[369146.132170] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[369146.140960] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[369146.149747] Call Trace:
[369146.152826] ? show_regs.cold+0x1a/0x1f
[369146.157592] ? __die+0x90/0xd9
[369146.161408] ? die+0x30/0x50
[369146.165002] ? do_trap+0x85/0xf0
[369146.169029] ? do_error_trap+0x7c/0xb0
[369146.173695] ? igb_up+0x51/0x160 [igb]
[369146.178355] ? do_invalid_op+0x3c/0x50
[369146.183019] ? igb_up+0x51/0x160 [igb]
[369146.187681] ? invalid_op+0x7c/0x90
[369146.192027] ? igb_up+0x51/0x160 [igb]
[369146.196691] ? hrtimer_try_to_cancel+0x2c/0x110
[369146.202305] ? schedule_hrtimeout_range_clock+0xa0/0x110
[369146.208871] ? hrtimer_init_sleeper+0x90/0x90
[369146.214277] ? igb_rx_fifo_flush_82575+0x32/0x270 [igb]
[369146.220738] ? igb_configure+0x417/0x650 [igb]
[369146.226247] ? igb_up+0x51/0x160 [igb]
[369146.230913] igb_io_resume+0x31/0x50 [igb]
[369146.235998] report_resume+0x5c/0x80
[369146.240449] ? pcie_portdrv_probe+0x70/0x70
[369146.245631] pci_walk_bus+0x75/0x90
[369146.249966] pcie_do_recovery+0x163/0x280
[369146.254947] aer_process_err_devices+0xa2/0xd1
[369146.260455] aer_isr.cold+0x52/0xa1
[369146.264799] ? __schedule+0x2bf/0x680
[369146.269349] ? irq_finalize_oneshot+0xf0/0xf0
[369146.274754] irq_thread_fn+0x28/0x50
[369146.279205] irq_thread+0xf8/0x180
[369146.283445] ? wake_threads_waitq+0x30/0x30
[369146.288638] kthread+0x104/0x140
[369146.292666] ? irq_thread_check_affinity+0x80/0x80
[369146.298597] ? __kthread_cancel_work+0x40/0x40
[369146.304106] ret_from_fork+0x35/0x40
- - - snip - - -

BUG_ON() comes from

- - - snip - - -
static inline void napi_enable(struct napi_struct *n)
{
BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state));
smp_mb__before_atomic();
clear_bit(NAPI_STATE_SCHED, &n->state);
clear_bit(NAPI_STATE_NPSVC, &n->state);
}
- - - snip - - -

The stack-trace shows that the AER handler starts here and goes into igb_up():

- - - snip - - -
static void igb_io_resume(struct pci_dev *pdev)
{
struct net_device *netdev = pci_get_drvdata(pdev);
struct igb_adapter *adapter = netdev_priv(netdev);

if (netif_running(netdev)) {
if (igb_up(adapter)) {
dev_err(&pdev->dev, "igb_up failed after reset\n");
return;
}
}

...
}
- - - snip - - -
Three functions come into the picture:

igb_io_error_detected() { // runs upon AER detection
igb_down(); // before this commit
noop; // with this commit
}

igb_io_resume() { // runs upon AER handling
igb_up()
}

igb_remove() { // runs upon rmmod igb, or surprise down removal (as shown in the commit log)
igb_down();
}

Before this commit, the flow for an AER on 09:00.0 was
igb_down; // from igb_io_error_detect() - this sets the NAPI_STATE_SCHED bit
igb_up; // from igb_io_resume() - that works as it finds a set NAPI_STATE_SCHED bit (and then clears it)

Now with this commit, the flow for an AER on 09:00.0 is
noop; // from igb_io_error_detect() - the NAPI_STATE_SCHED bit is not touched, e.g. it remains cleared
igb_up; // from igb_io_resume() - BUG_ON() is triggered as it finds a cleared NAPI_STATE_SCHED bit


I don't have a means to reproduce this i210 on thunderbolt issue and don't completely understand its flow. Before this commit, the "expected" flow for i210 on thunderbolt was probably

igb_down; // from igb_io_error_detect() - this sets the NAPI_STATE_SCHED bit
igb_up; // from igb_io_resume() - that works as it finds a set NAPI_STATE_SCHED bit and cleans it.
igb_down; // from igb_remove() - that works b/c of the previous igb_up()

The bug report shows that this was not the case. I don't completely understand the reported problem: what happens to and in igb_io_resume() and its call to igb_up()? In the commit log backtrace, I see igb_remove() eventually calling __dev_close_many() which clears the __LINK__STATE_START bit. igb_io_resume() does check this bit via netif_running() and calls only then igb_up(). My guess is igb_io_resume() executes after igb_remove() and skips therefore the execution of igb_up(). As we see from the commit log backtrace, that makes igb_remove() starve now in igb_down()->napi_synchronize().

How to fix that, I don't know.

Stefan