Re: [RFC PATCH v2 3/4] irq: Introduce IRQ_HANDLED_MANY

From: Thomas Gleixner
Date: Mon Dec 02 2024 - 17:53:54 EST


On Mon, Nov 18 2024 at 22:15, Leonardo Bras wrote:
> On Thu, Nov 14, 2024 at 09:50:36AM +0200, Andy Shevchenko wrote:
>> - the IRQ core disables IRQ while handling an IRQ number in question;
>
> Not necessarily:
> When on irqs are force-threaded, only a quick handler is called, returning
> IRQ_WAKE_THREAD, which is supposed to wake up the handler thread.
>
> * @IRQ_WAKE_THREAD: handler requests to wake the handler thread
>
> In this case (which is what I am dealing with), the actual handler will run
> in thread context (which I suppose don't disable IRQ for sched-out
> purposes).

Let's talk about the disable irq concepts here. There are 3 distinct variants:

1) Disable interrupts on the local CPU: local_irq_disable/save().

This only prevents the CPU from handling a pending interrupt,
but does not prevent new interrupts from being marked pending
in the interrupt controller.

2) Disable interrupts in the interrupt controller

disable_irq*() variants handle that. They come with two
flavors: LAZY and UNLAZY

LAZY does not mask the interrupt line right away. It only is masked
when an interrupt arrives while the line is marked "disabled". This
then sets the PENDING bit, which in turn causes a replay of the
interrupt once enable_irq() is bringing the disabled count down to
zero. LAZY has two purposes:

A) Prevent losing edge type interrupts

B) Optimization to avoid the maybe costly hardware access

UNLAZY masks the interrupt right away.

3) Disable interrupts at the device level

This prevents the device from raising interrupts.

Now there is another twist with force threaded interrupts.

If the interrupt is level triggered the interrupt line is masked in the
hard interrupt handler and unmasked when the thread returns.

For edge type interrupts that's handled differently in order to not lose
an edge. The line is not masked, so that if the device raises an
interrupt before the threaded handler returns, the threaded handler is
marked runnable again. That again comes in two flavors:

1) Non-RT kernel

That cannot result in a scenario where the force threaded handler
loops and interrupts keep arriving at the CPU because the CPU has
interrupts disabled across the force threaded handler and at least
on x86 and arm64 that's the CPU which is target of the hardware
interrupt.

So at max there can come a few interrupts between the first
interrupt and the threaded handler being scheduled in, but I doubt
it can be more than three.

2) RT kernel

The forced threaded handler runs with CPU interrupts enabled, so
when the threaded handler deals with the device, new interrupts can
come in at the hardware level and are accounted for, which is what
you are trying to address, right?

So the total amount of TX bytes, i.e. PASS_LIMIT * tx_loadsz,
becomes large enough to trigger the irq storm detector, right?

That obviously begs two questions:

1) Why do you want to deal with this interrupt flood at the storm
detector and not at the root cause of the whole thing?

It does not make any sense to take a gazillion of interrupts for
nothing and then work around the resulting detector fallout.

The obvious adhoc cure is to add this to serial8250_interrupt():

if (IS_ENABLED(CONFIG_PREEMPT_RT))
disable_irq_nosync(irq);

magic_loop();

if (IS_ENABLED(CONFIG_PREEMPT_RT))
enable_irq(irq);

Ideally we'd disable interrupt generation in the IER register
around the loop, but that's another can of worms as this can't be
done easily without holding port lock across the IER disabled
section. Also see below.


2) Is the 512 iterations loop (PASS_LIMIT) still appropriate?

That loop in serial8250_interrupt() looks pretty horrible even on a
non-RT kernel with an emulated UART.

The issue is mostly irrelevant on real hardware as the FIFO writes
are faster than the UART can empty the FIFO. So unless there are
two FIFO UARTs sharing the same interrupt line _and_ writing
large amount of data at the same time the loop will terminate after
one write cycle. There won't be an interrupt per TX byte either.

On emulated hardware this is very different because the write
causes a VMEXIT with a full round trip to user space, which
consumes the character and immediately raises the TX empty
interrupt again because there is no "slow" hardware involved.

With the default qemu FIFO depth of 16 bytes, this results in up to
512 * 16 = 8192 bytes written out with interrupts disabled for one
invocation of the interrupt handler (threaded or not)...

I just traced interrupt entry/exit in a VM and for a file with 256k
bytes written to /dev/ttyS0 this results in:

Interrupts: 35
Bytes/Interrupt: ~7490
Tmax: 104725 us
Tavg: 96778 us

So one interrupt handler invocation which writes up to 8k takes
roughly 100ms with interrupts disabled...

Now looking at the host side. For every write to the TX FIFO there
are _four_ invocations of kvm_set_irq() originating from kvm_vm_ioctl_irq_line():

CPU 36/KVM-2063 [097] ..... 1466862.728737: kvm_pio: pio_write at 0x3f8 size 1 count 1 val 0xd
CPU 36/KVM-2063 [097] ..... 1466862.728742: kvm_set_irq: gsi 4 level 0 source 0
CPU 36/KVM-2063 [097] ..... 1466862.728749: kvm_set_irq: gsi 4 level 0 source 0
CPU 36/KVM-2063 [097] ..... 1466862.728754: kvm_set_irq: gsi 4 level 1 source 0
CPU 36/KVM-2063 [097] ..... 1466862.728762: kvm_set_irq: gsi 4 level 1 source 0
CPU 36/KVM-2063 [097] ..... 1466862.728783: kvm_pio: pio_write at 0x3f8 size 1 count 1 val 0xa
CPU 36/KVM-2063 [097] ..... 1466862.728787: kvm_set_irq: gsi 4 level 0 source 0
CPU 36/KVM-2063 [097] ..... 1466862.728792: kvm_set_irq: gsi 4 level 0 source 0
CPU 36/KVM-2063 [097] ..... 1466862.728797: kvm_set_irq: gsi 4 level 1 source 0
CPU 36/KVM-2063 [097] ..... 1466862.728809: kvm_set_irq: gsi 4 level 1 source 0

No idea why this needs four ioctls and not only two, but this
clearly demonstrates that each byte written creates an edge
interrupt. No surprise that the storm detector triggers on RT. But
this also makes it clear why it's a patently bad idea to work
around this in the detector...

Now being "smart" and disabling THRI in the IER before the write
loop in serial8250_tx_chars() changes the picture to:

CPU 18/KVM-2045 [092] ..... 1466230.357478: kvm_pio: pio_write at 0x3f9 size 1 count 1 val 0x5

IER.THRI is cleared now so it's expected that nothing fiddles with
the interrupt line anymore, right? But ....

CPU 18/KVM-2045 [092] ..... 1466230.357479: kvm_set_irq: gsi 4 level 0 source 0
CPU 18/KVM-2045 [092] ..... 1466230.357481: kvm_set_irq: gsi 4 level 0 source 0
CPU 18/KVM-2045 [092] ..... 1466230.357484: kvm_pio: pio_write at 0x3f8 size 1 count 1 val 0x73
CPU 18/KVM-2045 [092] ..... 1466230.357485: kvm_set_irq: gsi 4 level 0 source 0
CPU 18/KVM-2045 [092] ..... 1466230.357487: kvm_set_irq: gsi 4 level 0 source 0
CPU 18/KVM-2045 [092] ..... 1466230.357489: kvm_set_irq: gsi 4 level 0 source 0
CPU 18/KVM-2045 [092] ..... 1466230.357491: kvm_set_irq: gsi 4 level 0 source 0
....

So every write to the TX FIFO sets the level to 0 four times in a
row to not create an edge. That's truly impressive!

Thanks,

tglx