Re: [PATCH v3 10/10] irqchip/riscv-imsic: Use IRQCHIP_MOVE_DEFERRED flag for PCI devices

From: Thomas Gleixner
Date: Tue Feb 04 2025 - 03:56:11 EST


On Tue, Feb 04 2025 at 13:24, Anup Patel wrote:
> Devices (such as PCI) which have non-atomic MSI update should
> migrate irq in the interrupt-context so use IRQCHIP_MOVE_DEFERRED
> flag for corresponding irqchips.
>
> The use of IRQCHIP_MOVE_DEFERRED further simplifies IMSIC vector
> movement as follows:
>
> 1) No need to handle the intermediate state seen by devices with
> non-atomic MSI update because imsic_irq_set_affinity() is called
> in the interrupt-context with interrupt masked.
> 2) No need to check temporary vector when completing vector movement
> on the old CPU in __imsic_local_sync().
> 3) No need to call imsic_local_sync_all() from imsic_handle_irq()

I have no idea how you came to that delusion.

IRQCHIP_MOVE_DEFERRED is part of the mechanism to handle this insanity
correctly. It does not prevent the device from observing and actually
using the intermediate state. All it does is to ensure that the kernel
can observe this case and act on it.

The fact that the kernel executes the interrupt handler on the original
target CPU does not prevent the device from firing another interrupt.
PCI/MSI interrupts are strictly edge. i.e. fire and forget.

IRQCHIP_MOVE_DEFERRED solely ensures that the racy affinity update in
the PCI device happens in the context of the original target CPU, which
is required to handle all possible cases correctly.

Let's assume the interrupt is affine to CPU0, vector A and a move to
CPU1, vector B is pending. So we have three possible scenarios:

CPU0 Device
interrupt
1) Raises interrupt on CPU0, vector A
...

write_msg()
write_address(CPU1)
2) Raises interrupt on CPU1, vector A
write_data(vector B)
3) Raises interrupt on CPU1, vector B

#1 is handled correctly because the interrupt is retriggered on CPU0,
vector A, which still has the interrupt associated (it's cleaned up
_after_ the first interrupt arrives on CPU1, vector B).

#2 cannot be handled because CPU1, vector A is either not in use or
associated to a completely unrelated interrupt, which means if that
happens the interrupt is lost and the device might become stale.

#3 is handled correctly for obvious reasons.

The only way to handle #2 properly is to do the intermediate update to
CPU0, vector B and checking for a pending interrupt on that. The
important role IRQCHIP_MOVE_DEFERRED plays here is that it guarantees
that the update happens on CPU0 (original target). Which in turn is
required to observe that CPU0, vector B has been raised.

The same could be achieved by executing that intermediate transition on
CPU0 with interrupts disabled by affining the calling context (thread)
to CPU0 or by issuing an IPI on CPU0 and doing it in that context. I
looked into that, but that has it's own pile of issues. So at the end
moving it in the context of the interrupt on the original CPU/vector
turned out to be the simplest way to achieve it.

Thanks,

tglx