Re: [PATCH] sysfs: add per pci device msi[x] irq listing (v3)

From: Matthew Wilcox
Date: Thu Sep 22 2011 - 09:54:33 EST


On Mon, Sep 19, 2011 at 11:47:15AM -0400, Neil Horman wrote:
> So a while back, I wanted to provide a way for irqbalance (and other apps) to
> definitively map irqs to devices, which, for msi[x] irqs is currently not really
> possible in user space. My first attempt wen't not so well:
> https://lkml.org/lkml/2011/4/21/308
>
> It was plauged by the same issues that prior attempts were, namely that it
> violated the one-file-one-value sysfs rule. I wandered off but have recently
> come back to this. I've got a new implementation here that exports a new
> subdirectory for every pci device, called msi_irqs. This subdirectory contanis
> a variable number of numbered subdirectories, in which the number represents an
> msi irq. Each numbered subdirectory contains attributes for that irq, which
> currently is only the mode it is operating in (msi vs. msix). I think fits
> within the constraints sysfs requires, and will allow irqbalance to properly map
> msi irqs to devices without having to rely on rickety, best guess methods like
> interface name matching.

This approach feels like building bigger rockets instead of a space
elevator :-)

What we need is to allow device drivers to ask for per-CPU interrupts,
and implement them in terms of MSI-X. I've made a couple of stabs at
implementing this, but haven't got anything working yet. It would solve
a number of problems:

1. NUMA cacheline fetch. At the moment, desc->istate gets modified by
handle_edge_irq. handle_percpu_irq doesn't need to worry about any
of that stuff, so doesn't touch desc->istate. I've heard this is a
significant problem for the high-speed networking people.

2. /proc/interrupts is unmanagable on large machines. There are hundreds
of interrupts and dozens of CPUs. This would go a long way to reducing
the number of rows in the table (doesn't do anything about the columns).

ie instead of this:

79: 0 0 0 0 0 0 0 0 PCI-MSI-edge eth1
80: 0 0 9275611 0 0 0 0 0 PCI-MSI-edge eth1-TxRx-0
81: 0 0 9275611 0 0 0 0 0 PCI-MSI-edge eth1-TxRx-1
82: 0 0 0 0 9275611 0 0 0 PCI-MSI-edge eth1-TxRx-2
83: 0 0 0 0 9275611 0 0 0 PCI-MSI-edge eth1-TxRx-3
84: 0 0 0 0 0 9275611 0 0 PCI-MSI-edge eth1-TxRx-4
85: 0 0 0 0 0 9275611 0 0 PCI-MSI-edge eth1-TxRx-5
86: 0 0 0 0 0 0 9275611 0 PCI-MSI-edge eth1-TxRx-6
87: 0 0 0 0 0 0 9275611 0 PCI-MSI-edge eth1-TxRx-7

We'd get this:

79: 0 0 0 0 0 0 0 0 PCI-MSI-edge eth1
80: 9275611 9275611 9275611 9275611 9275611 9275611 9275611 9275611 PCI-MSI-edge eth1-TxRx

3. /proc/irq/x/smp_affinity actually makes sense again. It can be a
mask of which interrupts are active instead of being a degenerate case
in which only the lowest set bit is actually honoured.

4. Easier to manage for the device driver. All it needs is to call
request_percpu_irq(...) instead of trying to figure out how many
threads/cores/numa nodes/... there are in the machine, and how many
other multi-interrupt devices there are; and thus how many interrupts
it should allocate. That can be left to the interrupt core which at
least has a chance of getting it right.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/