Re: [Intel-wired-lan] [PATCH v2] e1000e: free IRQ regardless of __E1000_DOWN

From: Neftin, Sasha
Date: Sun Nov 13 2016 - 03:35:32 EST


On 11/11/2016 12:35 AM, Baicar, Tyler wrote:
> Hello Sasha,
>
> On 11/9/2016 11:19 PM, Neftin, Sasha wrote:
>> On 11/9/2016 11:41 PM, Tyler Baicar wrote:
>>> Move IRQ free code so that it will happen regardless of the
>>> __E1000_DOWN bit. Currently the e1000e driver only releases its IRQ
>>> if the __E1000_DOWN bit is cleared. This is not sufficient because
>>> it is possible for __E1000_DOWN to be set without releasing the IRQ.
>>> In such a situation, we will hit a kernel bug later in e1000_remove
>>> because the IRQ still has action since it was never freed. A
>>> secondary bus reset can cause this case to happen.
>>>
>>> Signed-off-by: Tyler Baicar <tbaicar@xxxxxxxxxxxxxx>
>>> ---
>>> drivers/net/ethernet/intel/e1000e/netdev.c | 3 ++-
>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c
>>> b/drivers/net/ethernet/intel/e1000e/netdev.c
>>> index 7017281..36cfcb0 100644
>>> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
>>> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
>>> @@ -4679,12 +4679,13 @@ int e1000e_close(struct net_device *netdev)
>>> if (!test_bit(__E1000_DOWN, &adapter->state)) {
>>> e1000e_down(adapter, true);
>>> - e1000_free_irq(adapter);
>>> /* Link status message must follow this format */
>>> pr_info("%s NIC Link is Down\n", adapter->netdev->name);
>>> }
>>> + e1000_free_irq(adapter);
>>> +
>>> napi_disable(&adapter->napi);
>>> e1000e_free_tx_resources(adapter->tx_ring);
>>>
>> I would like not recommend insert this change. This change related
>> driver state machine, we afraid from lot of synchronization problem and
>> issues.
>> We need keep e1000_free_irq in loop and check for 'test_bit' ready.
>
> What do you mean here? There is no loop. If __E1000_DOWN is set then we
> will never free the IRQ.
>
>> Another point, does before execute secondary bus reset your SW back up
>> pcie configuration space as properly?
>
> After a secondary bus reset, the link needs to recover and go back to a
> working state after 1 second.
>
> From the callstack, the issue is happening while removing the endpoint
> from the system, before applying the secondary bus reset.
>
> The order of events is
> 1. remove the drivers
> 2. cause a secondary bus reset
> 3. wait 1 second
Actually, this is too much, usually link up in less than 100ms.You can
check Data Link Layer indication.
> 4. recover the link
>
> callstack:
> free_msi_irqs+0x6c/0x1a8
> pci_disable_msi+0xb0/0x148
> e1000e_reset_interrupt_capability+0x60/0x78
> e1000_remove+0xc8/0x180
> pci_device_remove+0x48/0x118
> __device_release_driver+0x80/0x108
> device_release_driver+0x2c/0x40
> pci_stop_bus_device+0xa0/0xb0
> pci_stop_bus_device+0x3c/0xb0
> pci_stop_root_bus+0x54/0x80
> acpi_pci_root_remove+0x28/0x64
> acpi_bus_trim+0x6c/0xa4
> acpi_device_hotplug+0x19c/0x3f4
> acpi_hotplug_work_fn+0x28/0x3c
> process_one_work+0x150/0x460
> worker_thread+0x50/0x4b8
> kthread+0xd4/0xe8
> ret_from_fork+0x10/0x50
>
> Thanks,
> Tyler
>
Hello Tyler,
Okay, we need consult more about this suggestion.
May I ask what is setup you run? Is there NIC or on board LAN? I would
like try reproduce this issue in our lab's too.
Also, is same issue observed with same scenario and others NIC's too?
Sasha