Re: [PATCH 4/4][RFC v2] x86/apic: Spread the vectors by choosing the idlest CPU

From: Thomas Gleixner
Date: Thu Sep 07 2017 - 05:45:31 EST


On Thu, 7 Sep 2017, Yu Chen wrote:
> On Thu, Sep 07, 2017 at 07:54:09AM +0200, Thomas Gleixner wrote:
> > Please switch it over to managed interrupts so the affinity spreading
> > happens in a sane way and the interrupts are properly managed on CPU
> > hotplug.
> Ok, I think currently in i40e driver the reservation of vectors
> leverages pci_enable_msix_range() and did not provide the affinity
> hit to low level IRQ system thus the managed interrupts is not enabled
> there(although later in i40e driver we use irq_set_affinity_hint() to
> spread the IRQs)

The affinity hint has nothing to do with that. It's a hint which tells user
space irqbalanced what the desired placement of the interrupt should
be. That was never used for spreading the affinity automatically in the
kernel and will never be used to do so. It was a design failure from the
very beginning and should be eliminated ASAP.

The general problem here is the way how the whole MSI(X) machinery works in
the kernel.

pci_enable_msix()

allocate_interrupts()
allocate_irqdescs()
allocate_resources()
allocate_DMAR_entries()
allocate_vectors()
initialize_MSI_entries()

The reason for this is historical. Drivers expect, that request_irq()
works, when they allocated the required resources upfront.

Of course this could be changed, but there are issues with that:

1) The driver must ensure that it does not enable any of the internal
interrupt delivery mechanisms in the device before request_irq() has
succeeded.

That needs auditing drivers all over the place or we just ignore that
and leave everyone puzzled why things suddenly stop to work.

2) Reservation accounting

When no vectors are allocated, we still need to make reservations so
we can tell a driver that the vector space is exhausted when it
invokes pci_enable_msix(). But how do we size the reservation space?
Based on nr_possible_cpus(), nr_online_cpus() or some other
heuristics?

Sure, we can just ignore that and resort to overcommitment and fail
request_irq() when resources are not available, which brings us back
to #1

But resorting to overcommitment does not make the cpu hotplug problem
magically go away. If queues and interrupts are used, then the non managed
variants are going to break affinities and move stuff to the still online
CPUs, which is going to fail.

Managed irqs just work because the driver stops the queue and the interrupt
(which can even stay requested) is shut down and 'kept' on the outgoing
CPU. If the CPU comes back then the vector is reestablished and the
interrupt started up on the fly. Stuff just works.....

Thanks,

tglx