Re: [PATCH 1/1] genirq/msi: Dynamic remove/add stroage adapter hits EEH

From: Thomas Gleixner
Date: Thu Mar 20 2025 - 04:48:37 EST


On Thu, Mar 20 2025 at 09:23, Thomas Gleixner wrote:
> On Wed, Mar 19 2025 at 21:58, Wen Xiong wrote:
>> We don't see the issue without dynamically remove/add operation.
>> There is a small window which irqbalance daemon kicks in during device
>> reset. So it took about over 6 hours to recreate the issue when doing
>> remove/add loop operation.
>
> Sure. You need a loop to hit the window. And it does not matter whether
> it's the probe or the remove which triggers it. Fact is that the reset
> wipes out the config space, which means that any read from the config
> space between reset and restore will return garbage. That problem is not
> restricted to the interrupt code. It's a general problem.

After looking at the code again, it's a problem in the remove()
function:

__ipr_remove()
ipr_initiate_ioa_bringdown()
// resets device
restore_config_space()
....
ipr_free_all_resources()
free_irqs()

So yes, it's not probe(). But the question is pretty much the same.

Why is a reset issued while the driver is fully operational and
resources are still in use?

Don't even think about telling me that this is a problem of the MSI
interrupt rework. It is not. It's been broken forever.

You _cannot_ pull the rung under a fully operational driver and expect
that all involved parts will just magically handle this gracefully.

What about tearing down resources first and then issuing the reset?

Thanks,

tglx