Re: [PATCH][RFC] x86/irq: Spread vectors on different CPUs

From: Thomas Gleixner
Date: Wed Jun 28 2017 - 15:03:52 EST


On Sat, 13 May 2017, Chen Yu wrote:
> This is because:
> 1. One of the drivers has declare many vector resource via
> pci_enable_msix_range(), say, this driver might likely want
> to reserve 6 per logical CPU, then there would be 192 of them.

This has been solved with the managed interrupt mechanism for multi queue
devices, which does not longer migrate these interrupts. It simply shuts
them down and restarts them when the cpu comes back online.

> 2. Besides, most of the vectors for this driver are allocated
> on CPU0 due to the current code strategy, so there would be
> insufficient slots left on CPU0 to receive any migrated
> IRQs from the other CPUs during CPU offine.

That's a driver sillyness. I assume it's a multi queue device, so it should
use the managed interrupt affinity code.

> 3. Furthermore, many vectors this driver reserved do no have
> any IRQ handler attached.

That's silly. Why does a driver allocate more resources than required? Just
because it can?

> As a result, all vectors on CPU0 were used out and the last alive
> CPU (31) failed to migrate its IRQs to the CPU0.
>
> As we might have difficulty to reduce the number of vectors reserved
> by that driver,

Why so? If the vectors are not used, then why are they allocated in the
first place? Can you point at the driver source please?

> there could be a compromising solution that, to spread
> the vector allocation on different CPUs rather than always choosing
> the *first* CPU in the cpumask. In this way, there would be a balanced
> vector distribution. Because many vectors reserved but without used(point 3
> above) will not be counted in during CPU offline, and they are now
> on nonboot CPUs this problem will be solved.

You are tackling the problem at the wrong end to some extent. I'm not
against sanely spreading interrupts, but we only want to do that, if there
is a compelling reason. A broken driver does not count.

> +static int pick_leisure_cpu(const struct cpumask *mask)
> +{
> + int cpu, vector;
> + int min_nr_vector = NR_VECTORS;
> + int target_cpu = cpumask_first_and(mask, cpu_online_mask);
> +
> + for_each_cpu_and(cpu, mask, cpu_online_mask) {
> + int nr_vectors = 0;
> +
> + for (vector = FIRST_EXTERNAL_VECTOR; vector < NR_VECTORS; vector++) {
> + if (!IS_ERR_OR_NULL(per_cpu(vector_irq, cpu)[vector]))
> + nr_vectors++;
> + }
> + if (nr_vectors < min_nr_vector) {
> + min_nr_vector = nr_vectors;
> + target_cpu = cpu;
> + }

That's beyond silly. Why would we have to count the available vectors over
and over? We can keep track of the vectors available per cpu continuously
when we allocate and free them.

> + }
> + return target_cpu;
> +}
> +
> static int __assign_irq_vector(int irq, struct apic_chip_data *d,
> const struct cpumask *mask)
> {
> @@ -131,7 +152,7 @@ static int __assign_irq_vector(int irq, struct apic_chip_data *d,
> /* Only try and allocate irqs on cpus that are present */
> cpumask_clear(d->old_domain);
> cpumask_clear(searched_cpumask);
> - cpu = cpumask_first_and(mask, cpu_online_mask);
> + cpu = pick_leisure_cpu(mask);
> while (cpu < nr_cpu_ids) {
> int new_cpu, offset;

I'm really not fond of this extra loop. That function is a huge nested loop
mess already. The cpu we select upfront gets fed into
apic->vector_allocation_domain() which might just say no because the CPU
does not belong to it. So yes, it kinda works for your purpose, but it's
just a bolted on 'solution'.

We really need to sit down and rewrite that allocation code from scratch,
use per cpu bitmaps, so we can do fast search over masks and keep track of
the vectors which are used/free per cpu.

That needs some thought because we need to respect the allocation domains,
but there must be a saner way than looping with interrupts disabled for a
gazillion hoops.

What the current code misses completely is to set a preference for the node
on which the interrupt was allocated. We just blindly take something out of
the default affinity mask which is all online cpus.

So there are a lot of shortcomings in the current implementation and we
should be able to replace it with something better, faster and while at it
make that spreading stuff work.

That does not mean, that we blindly support drivers allocating a gazillion
vectors just for nothing.

Thanks,

tglx