Re: [PATCH] igb: Fix igb_down hung on surprise removal
From: Stefan Schaeckeler
Date: Thu Jun 06 2024 - 12:06:50 EST
Hello Ying,
On 6/6/24 01:03, Ying Hsu wrote:
> On the CalDigit Thunderbolt Station 3 Plus, we've encountered an issue
> when the USB downstream display connection state changes. The
> problematic sequence observed is:
> ```
> igb_io_error_detected
> igb_down
> igb_io_error_detected
> igb_down
> ```
>
> The second igb_down call blocks at napi_synchronize.
From the backtrace in your commit message, I gain the impression you get a hotplug event for removing the ethernet device. From your commit message I gain the impression you get an AER as well which is handled in igb_io_error_detected()/igb_io_resume(). The problem lies IMHO in the interaction of both.
> Simply avoiding redundant igb_down calls makes the Ethernet of the thunderbolt dock unusable.
I'm not too sure if the current code is even perfect in your use-case. What happens when you get an AER on the ethernet device (without plugging it out at the same time)?
Can you try to AER inject a completion timeout into your ethernet device, similar how I showed it in my previous message? Just replace the bdf 09:00.0 with the bdf of your ethernet device. I expect a kernel crash like we see that on our embedded system.
> If Intel can identify when an Ethernet device is within a Thunderbolt
> tunnel, the patch can be more specific.
Stefan