Re: [PATCH] tg3: guard napi_disable and pci_disable_device calls

From: Pavan Chebbi

Date: Fri May 15 2026 - 12:01:11 EST


On Fri, May 15, 2026 at 4:28 PM Yury Murashka <yurypm@xxxxxxxxxx> wrote:
>
> During PCIe hot-plug events, uncorrectable errors can be reported and
> AER recovery for the tg3 device is initiated by the AER kernel driver.
> The tg3_io_error_detected function is the AER error recovery handler.
>
> From tg3_io_error_detected, we call tg3_netif_stop->tg3_napi_disable->
> napi_disable and return PCI_ERS_RESULT_NEED_RESET on non-fatal error.
> We expect that during AER recovery tg3_io_slot_reset and tg3_io_resume
> will be called. But AER error recovery can fail. For example, when one
> of PCIe devices on the same bus reports PCI_ERS_RESULT_NO_AER_DRIVER.
> As a result, tg3_io_slot_reset and tg3_io_resume are not called, PCIe
> device is disabled and NAPI is disabled (pci_disable_device and
> napi_disable are called from tg3_io_error_detected). Then we can try to
> disable PCIe link and napi_disable will be called again:

Calling napi_disable() in case of teardown due to error and in
ndo_stop is very common.
So I imagine many drivers will encounter this same situation. I am not
sure how real the NO_AER_DRIVER occurring situation is.
If yes, then we need to fix more drivers?

>
> napi_disable+0x1b/0x1b0
> tg3_napi_disable+0x89/0xa0 [tg3]
> tg3_netif_stop+0x37/0xe3 [tg3]
> tg3_stop+0x30/0x160 [tg3]
> tg3_close+0x2a/0x60 [tg3]
> __dev_close_many+0xad/0x130
> dev_close_many+0xb2/0x190
> unregister_netdevice_many_notify+0x19d/0xa00
> unregister_netdevice_queue+0xf8/0x140
> unregister_netdev+0x1c/0x30
> tg3_remove_one+0xaa/0x150 [tg3]
> pci_device_remove+0x42/0xb0
> device_release_driver_internal+0x19c/0x200
> pci_stop_bus_device+0x85/0xb0
> pci_stop_bus_device+0x2c/0xb0
> pci_stop_bus_device+0x2c/0xb0
> pci_stop_and_remove_bus_device+0x12/0x20
> pciehp_unconfigure_device+0x9f/0x160
> pciehp_disable_slot+0x67/0x100
> pciehp_handle_presence_or_link_change+0x77/0x350
>
> This is not expected by napi_disable and a thread can be locked in
> napi_disable forever. We have pcierr_recovery to cover a similar issue,
> but for fatal errors. We cannot reuse this flag because it is reset in
> tg3_io_resume, but it is not called when AER recovery fails.
>
> Similarly, if an AER error is reported and tg3_io_error_detected calls
> pci_disable_device, a subsequent device removal via tg3_remove_one or
> tg3_shutdown will call pci_disable_device again for the already-disabled
> device.

I believe the same argument is true here also..
P.S: patches containing fixes should mention 'net' and should contain fixes tag

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature