Re: [patch v2 12/14] [RFC] genirq/proc: Provide binary statistic interface

From: Dmitry Ilvokhin

Date: Wed Apr 01 2026 - 13:09:36 EST


On Fri, Mar 20, 2026 at 02:22:24PM +0100, Thomas Gleixner wrote:
> /proc/interrupts is expensive to evaluate for monitoring because:
>
> - it is text based and contains a lot of information which is not
> relevant for interrupt frequency analysis. Due to the extra information
> like chip name, hardware interrupt number, interrupt action names, it
> has to take the interrupt descriptor lock to output those items into
> the seq_file buffer. That obviously interferes with high frequency
> interrupt workloads.
>
> - it contains both device interrupts, per CPU and architecture specific
> interrupt counters without being able to look at them separately. The
> file is seekable by some definition of seekable as the position can
> change when interrupts are requested or freed, so the data has to be
> read completely to get a coherent picture.
>
> - it emits records for requested interrupts even if their interrupt count
> is zero.
>
> - it always prints the per CPU counters even if all but one of them are
> zero.
>
> - converting numbers to text and then parsing the text back to numbers in
> user space is a pretty wasteful exercise
>
> Provide a new interface which addresses the above pain points:
>
> 1) The interface is binary and emits variable length records per
> interrupt. Each record starts with a header containing the interrupt
> number and the number of data entries following the header. The data
> entries consist of a CPU number and count pair.
>
> 2) Interrupts with a total count of zero are skipped and produce no
> output at all.
>
> 3) Interrupts which have a single CPU affinity either due to a restricted
> affinity mask or due to the underlying interrupt chip restricting a
> mask to a single CPU target emit only one data entry.
>
> That means they are not emitting the stale counts on previous target
> CPUs but they are not really interesting for interrupt frequency
> analysis as they are not changing and therefore pointless for
> accounting.
>
> 4) The interface separates device interrupts, per CPU interrupts and
> architecture specific interrupts.
>
> Per CPU and architecture specific interrupts can only be monitored,
> while device interrupts can also be steered by changing the affinity
> unless they are affinity managed by the kernel.
>
> Per CPU interrupts are only available on architectures, e.g. ARM64,
> which use the regular interrupt descriptor mechanism for per CPU
> interrupt handling.
>
> Architectures which have their own mechanics, e.g. x86, do not enable
> and provide the per CPU interface as those interrupts are covered by
> the architecture specific accounting.
>
> 5) The readout is fully lockless so it does not interfere with concurrent
> interrupt handling.
>
> 6) Seek is restricted to seek(fd, 0, SEEK_SET) as that's the only
> operation which makes sense due to the variable record length and the
> dynamics of interrupt request/free operations which influence the
> position of the records in the output. For all other seek()
> invocations return the current file position, which makes e.g. python
> happy as an error code causes the file open checks to mark the
> resulting file object non-seekable.
>
> Implement support for /proc/irq/device_stats and /proc/irq/percpu_stats.
>
> The support for architecture specific interrupt statistics is added in a
> separate step.
>
> Reading /proc/irq/device_stats on a 256 CPU x86 machine with 83 requested
> interrupts produces 13 records due to skipping zero count interrupts. It
> results in 13 * 16 = 208 bytes of data as all device interrupts on x86 are
> single CPU targeted. That readout takes ~8us time in the kernel, while the
> full /proc/interrupts readout takes about 360us.
>
> Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxx>
> ---
> include/uapi/linux/irqstats.h | 27 +++
> kernel/irq/Kconfig | 3
> kernel/irq/proc.c | 314 ++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 344 insertions(+)
>
> --- /dev/null
> +++ b/include/uapi/linux/irqstats.h
> @@ -0,0 +1,27 @@
> +/* SPDX-License-Identifier: GPL-2.0-only WITH Linux-syscall-note */
> +#ifndef LINUX_UAPI_IRQSTATS_H
> +#define LINUX_UAPI_IRQSTATS_H
> +
> +/**
> + * irq_proc_stat_cpu - Data record for /proc/irq/stats
> + * @cpu: The CPU associated to @cnt
> + * @cnt: The count assiciated to @cpu

nit: s/assiciated/associated/

> + */
> +struct irq_proc_stat_cpu {
> + unsigned int cpu;
> + unsigned int cnt;
> +};

nit: UAPI structs should use __u32 instead of unsigned int.

> +
> +/**
> + * irq_proc_stat_data - Data header for /proc/irq/stats
> + * @irqnr: The interrupt number
> + * @entries: The number of records (max. nr_cpu_ids)
> + * @pcpu: Runtime sized array of per CPU stat records
> + */
> +struct irq_proc_stat_data {
> + unsigned int irqnr;
> + unsigned int entries;
> + struct irq_proc_stat_cpu pcpu[];
> +};

Same here.

Also, this struct has no extensibility mechanism. If irq_proc_stat_cpu
ever needs a new field, there's no way for userspace to detect the
layout change.

A __u32 entry_size set to sizeof(struct irq_proc_stat_cpu) would let
userspace stride through entries safely, even if the struct grows later.

> +
> +#endif
> --- a/kernel/irq/Kconfig
> +++ b/kernel/irq/Kconfig
> @@ -18,6 +18,9 @@ config GENERIC_IRQ_SHOW
> config GENERIC_IRQ_SHOW_LEVEL
> bool
>
> +config GENERIC_IRQ_STATS_PERCPU
> + bool
> +

[...]

> +static bool irq_stat_update_one(struct irq_proc_stat *s)
> +{
> + struct irq_proc_stat_data *d = s->data;
> +
> + if (IS_ENABLED(CONFIG_GENERIC_IRQ_PERCPU_STATS) && s->percpu)
> + irq_percpu_stat_update_one(s);

Should be GENERIC_IRQ_STATS_PERCPU, PERCPU and STATS are swapped with
each other.