Re: [RFC PATCH 01/15] irqchip/riscv-imsic: Use hierarchy to reach irq_set_affinity

From: Thomas Gleixner
Date: Wed Dec 04 2024 - 08:16:45 EST


On Wed, Dec 04 2024 at 09:13, Anup Patel wrote:
> On Wed, Dec 4, 2024 at 4:29 AM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>> As I was looking at something else MSI related I had a look at
>> imsic_irq_set_affinity() again.
>>
>> It's actually required to have the message write in that function and
>> not afterwards as you invoke imsic_vector_move() from that function.
>>
>> That's obviously not true for the remap case as that will not change the
>> message address/data pair because the remap table entry is immutable -
>> at least I assume so for my mental sanity sake :)
>>
>> But that brings me to a related question. How is this supposed to work
>> with non-atomic message updates? PCI/MSI does not necessarily provide
>> masking, and the write of the address/data pair is done in bits and
>> pieces. So you can end up with an intermediate state seen by the device
>> which ends up somewhere in interrupt nirvana space.
>>
>> See the dance in msi_set_affinity() and commit 6f1a4891a592
>> ("x86/apic/msi: Plug non-maskable MSI affinity race") for further
>> explanation.
>>
>> The way how the IMSIC driver works seems to be pretty much the same as
>> the x86 APIC mess:
>>
>> @address is the physical address of the per CPU MSI target
>> address and @data is the vector ID on that CPU.
>>
>> So the non-atomic update in case of non-maskable MSI suffers from the
>> same problem. It works most of the time, but if it doesn't you might
>> stare at the occasionally lost interrupt and the stale device in
>> disbelief for quite a while :)
>
> Yes, we have the same challenges as x86 APIC when changing
> MSI affinity.
>>
>> I might be missing something which magically prevent that though :)
>>
> Your understanding is correct. In fact, the IMSIC msi_set_affinity()
> handling is inspired from x86 APIC approach due to similarity in
> the overall MSI controller.
>
> The high-level idea of imsic_irq_set_affinity() is as follows:
>
> 1) Allocate new_vector (new CPU IMSIC address + new ID on that CPU)
>
> 2) Update the MSI address and data programmed in the device
> based on new_vector (see imsic_msi_update_msg())
>
> 3) At this point the device points to the new_vector but old_vector
> (old CPU IMSIC address + old ID on that CPU) is still enabled and
> we might have received MSI on old_vector while we were busy
> setting up a new_vector for the device. To address this, we call
> imsic_vector_move().
>
> 4) The imsic_vector_move() marks the old_vector as being
> moved and schedules a lazy timer on the old CPU.
>
> 5) The lazy timer expires on the old CPU and results in
> __imsic_local_sync() being called on the old CPU.
>
> 6) If there was a pending MSI on the old vector then the
> __imsic_local_sync() function injects an MSI to the
> new_vector using an MMIO write.
>
> It is very unlikely that an MSI from device will be dropped
> (unless I am missing something) but the unsolved issue
> is that handling of in-flight MSI received on the old_vector
> during the MSI re-programming is delayed which may have
> side effects on the device driver side.

Interrupt delivery can be delayed for tons of other reasons.

But yes, you are missing something which is worse than a jiffie delay:

CPU Device
msi_update_msg()
compose()
write()
write(msg->address_lo); [1]
write(msg->address_hi); [2]
Raises interrupt [3]
write(msg->data); [4]

[2] can be ignored as it should not change (otherwise 32bit only devices
would not work).

Lets assume that the original message was:

addr_lo = 0x1000, data = 0x10 (CPU1, vector 0x10)

The new message is

addr_lo = 0x2000, data = 0x20 (CPU2, vector 0x20)

After [2] the device sees:

addr_lo = 0x2000, data = 0x10 (CPU2, vector 0x10)

The interrupt raised in [3] will end up on CPU2 at vector 0x10 which
might be not in use or used by some other device. In any case the
interrupt is not reaching the real device handler and you can't see that
interrupt as pending in CPU1.

That's why x86 in case of non-remapped interrupts has this for the two
situations:

1) old->data == new->data

write_msg(new);

The next interrupt either arrives on the old CPU or on the new CPU
depending on timing. There is no intermediate state because the
vector (data) is the same on both CPUs.

2) old->data != new->data

tmp_msg.addr = old_msg.addr
tmp_msg.data = new_msg.data

write_msg(tmp_msg);

So after that write the device might raise the interrupt on CPU1
and vector 0x20.

The next step is

write_msg(new);

which changes the destination CPU to CPU2.

So depending what the device has observed the interrupt might end
up on

CPU1 vector 0x10 (old)
CPU1 vector 0x20 (tmp)
CPU2 vector 0x20 (new)

CPU1 vector 0x20 (tmp) is then checked in the pending register and
the interrupt is retriggered if pending.

That requires to move the interrupt from actual interrupt context on the
old target CPU. It allows to evaluate the old target CPUs pending
register with local interrupts disabled, which obviously does not work
remote.

I don't see a way how that can work remote with the IMSIC either even if
you can easily access the pending state of the remote CPU:

CPU0 CPU1 Device
set_affinity()
write_msg(tmp)
write(addr); // CPU1
write(data); // vector 0x20
raise IRQ
handle vector 0x20
(spurious or other device)

check_pending(CPU1, 0x20) == false -> Interrupt is lost

Remapped interrupts do not have that issue because the table update is
atomic vs. a concurrently raised interrupt, so that it either ends up on
the old or on the new destination. The device message does not change as
it always targets the table entry, which is immutable after setup.

That's why x86 has this special msi affinity setter for the non remapped
case. For the remapped case it's just using the hierarchy default which
ends up at the remap domain.

So the hierarchies look like this:

1) non-remap

PCI/MSI device domain
irq_chip.irq_set_affinity = msi_set_affinity;
VECTOR domain

2) remap

PCI/MSI device domain
irq_chip.irq_set_affinity = msi_domain_set_affinity;
REMAP domain
irq_chip.irq_set_affinity = remap_set_affinity;
VECTOR domain

The special case of msi_set_affinity() is not involved in the remap case
and solely utilized for the non-remap horrors. The remap case does not
require the interrupt to be moved in interrupt context either because
there is no intermediate step and pending register check on the old CPU
involved.

The vector cleanup of the old CPU always happens when the interrupt
arrives on the new target for the first time independent of remapping
mode. That's required for both cases to take care of the case where the
interrupt is raised on the old vector (cpu1, 0x10) before the write
happens and is pending in CPU1. So after desc::lock is released and
interrupts are reenabled the interrupt is handled on CPU1. The next one
is guaranteed to arrive on CPU2 and that triggers the clean up of the
CPU1 vector, which releases it for reuse in the matrix allocator.

I think you should model it in a similar way instead of trying to
artificially reuse imsic_irq_set_affinity() for the remap case,
especially when you decide to close the non-maskable MSI hole described
above, which I recommend to do :)

You still can utilize msi-lib and just have your private implementation
of init_dev_msi_info() as a wrapper around the library similar to
mbi_init_dev_msi_info() and its_init_dev_msi_info(). That wrapper would
just handle the non-remap case to set info::chip:irq_set_affinity to the
magic non-remap function.

Hope that helps.

> I believe in the future RISC-V AIA v2.0 (whenever that
> happens) will address the gaps in AIA v1.0 (like this one).

If that is strictly translation table based, yes.

Thanks,

tglx