Re: IRQ affinity not working on Xen pci-platform device^W^W^W QEMU split-irqchip I/O APIC.

From: David Woodhouse
Date: Sat Mar 04 2023 - 04:57:57 EST


On Sat, 2023-03-04 at 01:28 +0100, Thomas Gleixner wrote:
> David!
>
> On Fri, Mar 03 2023 at 16:54, David Woodhouse wrote:
> > On Fri, 2023-03-03 at 17:51 +0100, Thomas Gleixner wrote:
> > > >
> > > > [    0.577173] ACPI: \_SB_.LNKC: Enabled at IRQ 11
> > > > [    0.578149] The affinity mask was 0-3
> > > > [    0.579081] The affinity mask is 0-3 and the handler is on 2
> > > > [    0.580288] The affinity mask is 0 and the handler is on 2
> > >
> > > What happens is that once the interrupt is requested, the affinity
> > > setting is deferred to the first interrupt. See the marvelous dance in
> > > arch/x86/kernel/apic/msi.c::msi_set_affinity().
> > >
> > > If you do the setting before request_irq() then the startup will assign
> > > it to the target mask right away.
> > >
> > > Btw, you are using irq_get_affinity_mask(), which gives you the desired
> > > target mask. irq_get_effective_affinity_mask() gives you the real one.
> > >
> > > Can you verify that the thing moves over after the first interrupt or is
> > > that too late already?
> >
> > It doesn't seem to move. The hack to just return IRQ_NONE if invoked on
> > CPU != 0 was intended to do just that. It's a level-triggered interrupt
> > so when the handler does nothing on the "wrong" CPU, it ought to get
> > invoked again on the *correct* CPU and actually work that time.
>
> So much for the theory. This is virt after all so it does not
> necessarily behave like real hardware.

I think you're right. This looks like a QEMU bug with the "split
irqchip" I/OAPIC.

For reasons I'm unclear about, and which lack a comment in the code,
QEMU still injects I/OAPIC events into the kernel with kvm_set_irq().
(I think it's do to with caching, because QEMU doesn't cache interrupt-
remapping translations anywhere *except* in the KVM IRQ routing table,
so if it just synthesised an MSI message every time it'd have to
retranslate it every time?)

Tracing the behaviour here shows:

• First interrupt happens on CPU2.
• Linux updates the I/OAPIC RTE to point to CPU0, but QEMU doesn't
update the KVM IRQ routing table yet.
* QEMU retriggers the (still-high, level triggered) IRQ.
• QEMU calls kvm_set_irq(11), delivering it to CPU2 again.
• QEMU *finally* calls ioapic_update_kvm_routes().
• Linux sees the interrupt on CPU2 again.

$ qemu-system-x86_64 -display none -serial mon:stdio \
-accel kvm,xen-version=0x4000a,kernel-irqchip=split \
-kernel ~/git/linux/arch/x86/boot//bzImage \
-append "console=ttyS0,115200 xen_no_vector_callback" \
-smp 4 --trace ioapic\* --trace xenstore\*


...

xenstore_read tx 0 path control/platform-feature-xs_reset_watches
ioapic_set_irq vector: 11 level: 1
ioapic_set_remote_irr set remote irr for pin 11
ioapic_service: trigger KVM IRQ 11
[ 0.523627] The affinity mask was 0-3 and the handler is on 2
ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x27 size 0x4 val 0x26
ioapic_update_kvm_routes: update KVM route for IRQ 11: fee02000 8021
ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x26 size 0x4 val 0x18021
xenstore_reset_watches
ioapic_set_irq vector: 11 level: 1
ioapic_mem_read ioapic mem read addr 0x10 regsel: 0x26 size 0x4 retval 0x1c021
[ 0.524569] ioapic_ack_level IRQ 11 moveit = 1
ioapic_eoi_broadcast EOI broadcast for vector 33
ioapic_clear_remote_irr clear remote irr for pin 11 vector 33
ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x26 size 0x4 val 0x26
ioapic_mem_read ioapic mem read addr 0x10 regsel: 0x26 size 0x4 retval 0x18021
[ 0.525235] ioapic_finish_move IRQ 11 calls irq_move_masked_irq()
[ 0.526147] irq_do_set_affinity for IRQ 11, 0
[ 0.526732] ioapic_set_affinity for IRQ 11, 0
[ 0.527330] ioapic_setup_msg_from_msi for IRQ11 target 0
ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x26 size 0x4 val 0x27
ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x27 size 0x4 val 0x0
ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x27 size 0x4 val 0x26
ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x26 size 0x4 val 0x18021
[ 0.527623] ioapic_set_affinity returns 0
[ 0.527623] ioapic_finish_move IRQ 11 calls unmask_ioapic_irq()
ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x26 size 0x4 val 0x26
ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x26 size 0x4 val 0x8021
ioapic_set_remote_irr set remote irr for pin 11
ioapic_service: trigger KVM IRQ 11
ioapic_update_kvm_routes: update KVM route for IRQ 11: fee00000 8021
[ 0.529571] The affinity mask was 0 and the handler is on 2
[ xenstore_watch path memory/target token FFFFFFFF92847D40
xenstore_watch_event path memory/target token FFFFFFFF92847D40
ioapic_set_irq vector: 11 level: 1
0.530486] ioapic_ack_level IRQ 11 moveit = 0


This is with Linux doing basically nothing when the handler is invoked
on the 'wrong' CPU, and just waiting for it to be right.

Commenting out the kvm_set_irq() calls in ioapic_service() and letting
QEMU synthesise an MSI every time works. Better still, so does this,
making it update the routing table *before* retriggering the IRQ when
the guest updates the RTE:

--- a/hw/intc/ioapic.c
+++ b/hw/intc/ioapic.c
@@ -405,6 +409,7 @@ ioapic_mem_write(void *opaque, hwaddr addr,
uint64_t val,
s->ioredtbl[index] |= ro_bits;
s->irq_eoi[index] = 0;
ioapic_fix_edge_remote_irr(&s->ioredtbl[index]);
+ ioapic_update_kvm_routes(s);
ioapic_service(s);
}
}
@@ -418,7 +423,6 @@ ioapic_mem_write(void *opaque, hwaddr addr,
uint64_t val,
break;
}

- ioapic_update_kvm_routes(s);
}

static const MemoryRegionOps ioapic_io_ops = {



Now, I don't quite see why we don't get a *third* interrupt, since
Linux did nothing to clear the level of IRQ 11 and the last trace I see
from QEMU's ioapic_set_irq confirms it's still set. But I've exceeded
my screen time for the day, so I'll have to frown at that part some
more later. I wonder if the EOI is going missing because it's coming
from the wrong CPU? Note no 'EOI broadcast' after the last line in the
log I showed above; it isn't just that I trimmed it there.

I don't think we need to do anything in Linux; if the handler gets
invoked on the wrong CPU it'll basically find no events pending for
that CPU and return having done nothing... and *hopefully* should be
re-invoked on the correct CPU shortly thereafter.

Attachment: smime.p7s
Description: S/MIME cryptographic signature