Re: [PATCH][RFC] x86/irq: Spread vectors on different CPUs

From: Chen Yu
Date: Wed Jul 05 2017 - 10:02:14 EST


Hi Thomas,
On Wed, Jun 28, 2017 at 09:03:19PM +0200, Thomas Gleixner wrote:
> On Sat, 13 May 2017, Chen Yu wrote:
> > This is because:
> > 1. One of the drivers has declare many vector resource via
> > pci_enable_msix_range(), say, this driver might likely want
> > to reserve 6 per logical CPU, then there would be 192 of them.
>
> This has been solved with the managed interrupt mechanism for multi queue
> devices, which does not longer migrate these interrupts. It simply shuts
> them down and restarts them when the cpu comes back online.
>
I think the driver has already enabled multi queue:
ethtool -l enp187s0f0
Channel parameters for enp187s0f0:
Pre-set maximums:
RX: 0
TX: 0
Other: 1
Combined: 128
Current hardware settings:
RX: 0
TX: 0
Other: 1
Combined: 32
> > 2. Besides, most of the vectors for this driver are allocated
> > on CPU0 due to the current code strategy, so there would be
> > insufficient slots left on CPU0 to receive any migrated
> > IRQs from the other CPUs during CPU offine.
>
> That's a driver sillyness. I assume it's a multi queue device, so it should
> use the managed interrupt affinity code.
>
Let me Cc my colleagues Mitch and Harshitha who are working on
this driver currently.
> > 3. Furthermore, many vectors this driver reserved do no have
> > any IRQ handler attached.
>
> That's silly. Why does a driver allocate more resources than required? Just
> because it can?
>
The reason why the driver has reserved the resource but w/o using them
might be that(well, it is just my guess), some network cable are not
plugged in the interface at the moment, although the driver has been
loaded. I will confirm if this is the case first.
> > As a result, all vectors on CPU0 were used out and the last alive
> > CPU (31) failed to migrate its IRQs to the CPU0.
> >
> > As we might have difficulty to reduce the number of vectors reserved
> > by that driver,
>
> Why so? If the vectors are not used, then why are they allocated in the
> first place? Can you point at the driver source please?
>
We are testing the net/ethernet/intel/i40e driver now. Previously we have
introduced a mechanism to release some vectors before S4 thus to work
around this issue, but as you mentioned before, if these are multi-queues,
the release will be accomplisheded automatically during cpu hotplug, I wonder
if the vectors reserved will also be released even without handler attached?
> > there could be a compromising solution that, to spread
> > the vector allocation on different CPUs rather than always choosing
> > the *first* CPU in the cpumask. In this way, there would be a balanced
> > vector distribution. Because many vectors reserved but without used(point 3
> > above) will not be counted in during CPU offline, and they are now
> > on nonboot CPUs this problem will be solved.
>
> You are tackling the problem at the wrong end to some extent. I'm not
> against sanely spreading interrupts, but we only want to do that, if there
> is a compelling reason. A broken driver does not count.
>
Understand.
> > +static int pick_leisure_cpu(const struct cpumask *mask)
> > +{
> > + int cpu, vector;
> > + int min_nr_vector = NR_VECTORS;
> > + int target_cpu = cpumask_first_and(mask, cpu_online_mask);
> > +
> > + for_each_cpu_and(cpu, mask, cpu_online_mask) {
> > + int nr_vectors = 0;
> > +
> > + for (vector = FIRST_EXTERNAL_VECTOR; vector < NR_VECTORS; vector++) {
> > + if (!IS_ERR_OR_NULL(per_cpu(vector_irq, cpu)[vector]))
> > + nr_vectors++;
> > + }
> > + if (nr_vectors < min_nr_vector) {
> > + min_nr_vector = nr_vectors;
> > + target_cpu = cpu;
> > + }
>
> That's beyond silly. Why would we have to count the available vectors over
> and over? We can keep track of the vectors available per cpu continuously
> when we allocate and free them.
>
Yes, that would be much more efficient:)
> > + }
> > + return target_cpu;
> > +}
> > +
> > static int __assign_irq_vector(int irq, struct apic_chip_data *d,
> > const struct cpumask *mask)
> > {
> > @@ -131,7 +152,7 @@ static int __assign_irq_vector(int irq, struct apic_chip_data *d,
> > /* Only try and allocate irqs on cpus that are present */
> > cpumask_clear(d->old_domain);
> > cpumask_clear(searched_cpumask);
> > - cpu = cpumask_first_and(mask, cpu_online_mask);
> > + cpu = pick_leisure_cpu(mask);
> > while (cpu < nr_cpu_ids) {
> > int new_cpu, offset;
>
> I'm really not fond of this extra loop. That function is a huge nested loop
> mess already. The cpu we select upfront gets fed into
> apic->vector_allocation_domain() which might just say no because the CPU
> does not belong to it. So yes, it kinda works for your purpose, but it's
> just a bolted on 'solution'.
>
> We really need to sit down and rewrite that allocation code from scratch,
> use per cpu bitmaps, so we can do fast search over masks and keep track of
> the vectors which are used/free per cpu.
>
> That needs some thought because we need to respect the allocation domains,
> but there must be a saner way than looping with interrupts disabled for a
> gazillion hoops.
>
> What the current code misses completely is to set a preference for the node
> on which the interrupt was allocated. We just blindly take something out of
> the default affinity mask which is all online cpus.
>
> So there are a lot of shortcomings in the current implementation and we
> should be able to replace it with something better, faster and while at it
> make that spreading stuff work.
Agree. Sounds to be a good improvement, let me check the i40e driver first
and then see if this can be revised according to this design.
>
> That does not mean, that we blindly support drivers allocating a gazillion
> vectors just for nothing.
>
Thanks,
Yu
> Thanks,
>
> tglx
>