Re: [RFC PATCH v2 00/10] irqchip/irq-gic: Optimize masking by leveraging EOImode=1

From: Marc Zyngier
Date: Thu May 27 2021 - 07:17:44 EST


On Tue, 25 May 2021 18:32:45 +0100,
Valentin Schneider <valentin.schneider@xxxxxxx> wrote:
>
> Hi folks!
>
> This is the spiritual successor to [1], which was over 6 years ago (!).
>
> Revisions
> =========
>
> RFCv1 -> RFCv2
> ++++++++++++++
>
> o Rebased against latest tip/irq/core
> o Applied cleanups suggested by Thomas
>
> o Collected some performance results
>
> Background
> ==========
>
> GIC mechanics
> +++++++++++++
>
> There are three IRQ operations:
> o Acknowledge. This gives us the IRQ number that interrupted us, and also
> - raises the running priority of the CPU interface to that of the IRQ
> - sets the active bit of the IRQ
> o Priority Drop. This "clears" the running priority.
> o Deactivate. This clears the active bit of the IRQ.
>
> o The CPU interface has a running priority value. No interrupt of lower or
> equal priority will be signaled to the CPU attached to that interface. On
> Linux, we only have two priority values: pNMIs at highest priority, and
> everything else at the other priority.
> o Most GIC interrupts have an "active" bit. This bit is set on Acknowledge
> and cleared on Deactivate. A given interrupt cannot be re-signaled to a
> CPU if it has its active bit set (i.e. if it "fires" again while it's
> being handled).
>
> EOImode fun
> +++++++++++
>
> In EOImode=0, Priority Drop and Deactivate are undissociable. The
> (simplified) interrupt handling flow is as follows:
>
> <~IRQ>
> Acknowledge
> Priority Drop + Deactivate
> <interrupts can once again be signaled, once interrupts are re-enabled>
>
> With EOImode=1, we can invoke each operation individually. This gives us:
>
> <~IRQ>
> Acknowledge
> Priority Drop
> <*other* interrupts can be signaled from here, once interrupts are re-enabled>
> Deactivate
> <*this* interrupt can be signaled again>
>
> What this means is that with EOImode=1, any interrupt is kept "masked" by
> its active bit between Priority Drop and Deactivate.
>
> Threaded IRQs and ONESHOT
> =========================
>
> ONESHOT threaded IRQs must remain masked between the main handler and the
> threaded handler. Right now we do this using the conventional irq_mask()
> operations, which looks like this:
>
> <irq handler>
> Acknowledge
> Priority Drop
> irq_mask()
> Deactivate
>
> <threaded handler>
> irq_unmask()
>
> However, masking for the GICs means poking the distributor, and there's no
> sysreg for that - it's an MMIO access. We've seen above that our IRQ
> handling can give us masking "for free", and this is what this patch set is
> all about. It turns the above handling into:
>
> <irq handler>
> Acknowledge
> Priority Drop
>
> <threaded handler>
> Deactivate
>
> No irq_mask() => fewer MMIO accesses => happier users (or so I've been
> told). This is especially relevant to PREEMPT_RT which forces threaded
> IRQs.
>
> Functional testing
> ==================
>
> GICv2
> +++++
>
> I've tested this on my Juno with forced irqthreads. This makes the pl011
> IRQ into a threaded ONESHOT IRQ, so I spammed my keyboard into the console
> and verified via ftrace that there were no irq_mask() / irq_unmask()
> involved.
>
> GICv3
> +++++
>
> I've tested this on my Ampere eMAG, which uncovered "fun" interactions with
> the MSI domains. Did the same trick as the Juno with the pl011.
>
> pNMIs cause said eMAG to freeze, but that's true even without my patches. I
> did try them out under QEMU+KVM and that looked fine, although that means I
> only got to test EOImode=0. I'll try to dig into this when I get some more
> cycles.

That's interesting/worrying. As far as I remember, this machine uses
GIC500, which is a well known quantity. If pNMIs are causing issues,
that'd probably be a CPU interface problem. Can you elaborate on how
you tried to test that part? Just using the below benchmark?

>
> Performance impact
> ==================
>
> Benchmark
> +++++++++
>
> Finding a benchmark that leverages a force-threaded IRQ has proved to be
> somewhat of a pain, so I crafted my own. It's a bit daft, but so are most
> benchmarks (though this one might win a prize).

I love it (and wrote similar hacks in my time)! :D Can you put that up
somewhere so that I can run the same test on my own zoo and find out
how it fares?

>
> Long story short, I'm picking an unused IRQ and have it be
> force-threaded. The benchmark then is:
>
> <bench thread>
> loop:
> irq_set_irqchip_state(irq, IRQCHIP_STATE_PENDING, true);
> wait_for_completion(&done);
>
> <threaded handler>
> complete(&done);
>
> A more complete picture would be:
>
> <bench thread> <whatever is on CPU0> <IRQ thread>
> raise IRQ
> wait
> run flow handler
> wake IRQ thread
> finish handling
> wake bench thread
>
> Letting this run for a fixed amount of time lets me measure an entire IRQ
> handling cycle, which is what I'm after since there's one less mask() in
> the flow handler and one less unmask() in the threaded handler.
>
> You'll note there's some potential "noise" in there due to scheduling both
> the benchmark thread and the IRQ thread. However, the IRQ thread is pinned
> to the IRQ's affinity, and I also pinned the benchmark thread in my tests,
> which should keep this noise to a minimum.
>
> Results
> +++++++
>
> On a Juno r0, 20 iterations of 5 seconds of that benchmark yields
> (measuring irqs/sec):
>
> | mean | median | 90th percentile | 99th percentile |
> |------+--------+-----------------+-----------------|
> | +11% | +11% | +12% | +14% |
>
> On an Ampere eMAG, 20 iterations of 5 seconds of that benchmark yields
> (measuring irqs/sec):
>
> | mean | median | 90th percentile | 99th percentile |
> |------+--------+-----------------+-----------------|
> | +20% | +20% | +20% | +20% |
>
> This is still quite "artificial", but it reassures me in that skipping those
> (un)mask operations can yield some measurable improvement.

20% improvement is even higher than I suspected!

Thanks,

M.

--
Without deviation from the norm, progress is not possible.