Re: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible

From: Thomas Gleixner
Date: Wed Apr 04 2018 - 15:38:37 EST


On Wed, 4 Apr 2018, Ming Lei wrote:
> On Wed, Apr 04, 2018 at 10:25:16AM +0200, Thomas Gleixner wrote:
> > In the example above:
> >
> > > > > irq 39, cpu list 0,4
> > > > > irq 40, cpu list 1,6
> > > > > irq 41, cpu list 2,5
> > > > > irq 42, cpu list 3,7
> >
> > and assumed that at driver init time only CPU 0-3 are online then the
> > hotplug of CPU 4-7 will not result in any interrupt delivered to CPU 4-7.
>
> Indeed, and I just tested this case, and found that no interrupts are
> delivered to CPU 4-7.
>
> In theory, the affinity has been assigned to these irq vectors, and
> programmed to interrupt controller, I understand it should work.
>
> Could you explain it a bit why interrupts aren't delivered to CPU 4-7?

As I explained before:

"If the device is already in use when the offline CPUs get hot plugged, then
the interrupts still stay on cpu 0-3 because the effective affinity of
interrupts on X86 (and other architectures) is always a single CPU."

IOW. If you set the affinity mask so it contains more than one CPU then the
kernel selects a single CPU as target. The selected CPU must be online and
if there is more than one online CPU in the mask then the kernel picks the
one which has the least number of interrupts targeted at it. This selected
CPU target is programmed into the corresponding interrupt chip
(IOAPIC/MSI/MSIX....) and it stays that way until the selected target CPU
goes offline or the affinity mask changes.

The reasons why we use single target delivery on X86 are:

1) Not all X86 systems support multi target delivery

2) If a system supports multi target delivery then the interrupt is
preferrably delivered to the CPU with the lowest APIC ID (which
usually corresponds to the lowest CPU number) due to hardware magic
and only a very small percentage of interrupts are delivered to the
other CPUs in the multi target set. So the benefit is rather dubious
and extensive performance testing did not show any significant
difference.

3) The management of multi targets on the software side is painful as
the same low level vector number has to be allocated on all possible
target CPUs. That's making a lot of things including hotplug more
complex for very little - if at all - benefit.

So at some point we ripped out the multi target support on X86 and moved
everything to single target delivery mode.

Other architectures never supported multi target delivery either due to
hardware restrictions or for similar reasons why X86 dropped it. There
might be a few architectures which support it, but I have no overview at
the moment.

The information is in procfs

# cat /proc/irq/9/smp_affinity_list
0-3
# cat /proc/irq/9/effective_affinity_list
1

# cat /proc/irq/10/smp_affinity_list
0-3
# cat /proc/irq/10/effective_affinity_list
2

smp_affinity[_list] is the affinity which is set either by the kernel or by
writing to /proc/irq/$N/smp_affinity[_list]

effective_affinity[_list] is the affinity which is effective, i.e. the
single target CPU to which the interrupt is affine at this point.

As you can see in the above examples the target CPU is selected from the
given possible target set and the internal spreading of the low level x86
vector allocation code picks a CPU which has the lowest number of
interrupts targeted at it.

Let's assume for the example below

# cat /proc/irq/10/smp_affinity_list
0-3
# cat /proc/irq/10/effective_affinity_list
2

that CPU 3 was offline when the device was initialized. So there was no way
to select it and when CPU 3 comes online there is no reason to change the
affinity of that interrupt, at least not from the kernel POV. Actually we
don't even have a mechanism to do so automagically.

If I offline CPU 2 after onlining CPU 3 then the kernel has to move the
interrupt away from CPU 2, so it selects CPU 3 as it's the one with the
lowest number of interrupts targeted at it.

Now this is a bit different if you use affinity managed interrupts like
NVME and other devices do.

Many of these devices create one queue per possible CPU, so the spreading
is simple; One interrupt per possible cpu. Pretty boring.

When the device has less queues than possible CPUs, then stuff gets more
interesting. The queues and therefore the interrupts must be targeted at
multiple CPUs. There is some logic which spreads them over the numa nodes
and takes siblings into account when Hyperthreading is enabled.

In both cases the managed interrupts are handled over CPU soft
hotplug/unplug:

1) If a CPU is soft unplugged and an interrupt is targeted at the CPU
then the interrupt is either moved to a still online CPU in the
affinity mask or if the outgoing CPU is the last one in the affinity
mask it is shut down.

2) If a CPU is soft plugged then the interrupts are scanned and the ones
which are managed and shutdown checked whether the affinity mask
contains the upcoming CPU. If that's the case then the interrupt is
started up and can deliver interrupts for the corresponding queue.

If an interupt is managed and already started, then nothing happens
and the effective affinity is untouched even if the upcoming CPU is in
the affinity set.

Lets briefly talk about the 3 cpu masks:

1) cpus_possible_mask:

The CPUs which are possible on a system.

2) cpus_present_mask:

The CPUs which are present on a system. Present means phsyically
present. Physical hotplug sets or removes CPUs from that mask,

"Physical" hotplug is used in virtualization as well.

3) cpus_online_mask:

The CPUs which are soft onlined. If a present CPU is not soft onlined
then its cleared in the online mask, but still set in the present
mask.

Now back to my suggestion in the other end of this thread, that we should
use cpus_present_mask instead of cpus_online_mask.

The reason why I suggested this is that we have to differentiate between
soft plugging and phsycial plugging of CPUs.

If CPUs are in the present mask, i.e. phsyically available, but not in the
online mask, then it's trivial to plug them soft by writing to the
corresponding online file in sysfs. CPU soft plugging is used for power
management nowadays, so the scenario I described in the other mail is not
completely unrealistic.

In case of physical hotplug it's a different story. Neither the kernel nor
user space can plug a CPU phsyically. It needs interaction by the operator,
i.e. in the real world by inserting/removing hardware or in the
virtualization space by changing the current CPU allocation. So here the
present mask wont help when the number of queues is less than the number of
possible CPUs and an initially not present CPU gets 'physically' plugged
in.

To make things worse we have the unfortunate case of qualiteee BIOS/ACPI
tables which claim that there are more possible CPUs than present CPUs on
systems which cannot support phsyical hotplug due to lack of hardware
support. Unfortunately there is no simple way to figure out whether a
system supports physical hotplug or not, so we cannot make an informed
decision here. But we can look at the present mask which tells us how many
CPUs are physically available. In a regular boot up the present mask and
the online mask are identical, so there is no difference.

For the physical hotplug case - real or virtual - neither of the spreading
algorithms is ideal. Solving this needs more thought as it would require to
recalculate the spreading once the physically plugged CPUs become
available/online.

Hope that clarifies the internals.

Thanks,

tglx