RFD: x86: Sanitize the vector allocator

From: Thomas Gleixner
Date: Sun Sep 03 2017 - 15:19:01 EST


The vector allocator of x86 is a pretty stupid linear search algorithm with
a worst case of

nr_vectors * nr_online_cpus * nr_cpus_in_affinity mask

It has some other magic properties and really wants to be replaced by
something smarter.

That needs quite some cleanup of the vector management code outside of the
allocator, which I started to work on with the cleanup of the IDT
management which is headed for 4.14. I have some other things in the
pipeline which eliminate quite some duct tape in that area, but I ran into
a couple of interesting things:

1) Multi CPU affinities

This is only vailable when the APIC is using logical destination
mode. With physical destination mode there is already a restriction to a
single CPU target.

The multi CPU affinity is biased towards the CPU with the lowest APIC ID
in the destination bitfield. Only if that APIC is busy (ISR not empty)
then the next APIC gets it.

A full kernel build on a SKL 4 CPU desktop machine with affinity set to
CPU0-3 shows that more than 90 percent of the AHCI interrupts end up on
CPU0.

Aside of that the same cold build (right after boot) is about 2% faster
when the AHCI interrupt is only affine to CPU0.

I did some experiments on all my machines which have logical destination
mode with various workloads and the results are similiar. The
distribution of interrupts on the CPUs varies with the workloads, but
the vast majority always ends up on CPU0

I've not found a case where the multi CPU affinity is superiour. I might
have the wrong workloads and the wrong machines, but it would be
extremly helpful just to get rid of this and use single CPU affinities
only. That'd simplify the allocator along with the various APIC
implementations.


2) The 'priority level' spreading magic

The comment in __asign_irq_vector says:

* NOTE! The local APIC isn't very good at handling
* multiple interrupts at the same interrupt level.
* As the interrupt level is determined by taking the
* vector number and shifting that right by 4, we
* want to spread these out a bit so that they don't
* all fall in the same interrupt level.

After doing some palaeontological research I found the following in the
PPro Developer Manual Volume 3:

"7.4.2. Valid Interrupts

The local and I/O APICs support 240 distinct vectors in the range of 16
to 255. Interrupt priority is implied by its vector, according to the
following relationship: priority = vector / 16

One is the lowest priority and 15 is the highest. Vectors 16 through
31 are reserved for exclusive use by the processor. The remaining
vectors are for general use. The processorʼs local APIC includes an
in-service entry and a holding entry for each priority level. To avoid
losing inter- rupts, software should allocate no more than 2 interrupt
vectors per priority."

The current SDM tells nothing about that, instead it states:

"If more than one interrupt is generated with the same vector number,
the local APIC can set the bit for the vector both in the IRR and the
ISR. This means that for the Pentium 4 and Intel Xeon processors, the
IRR and ISR can queue two interrupts for each interrupt vector: one
in the IRR and one in the ISR. Any additional interrupts issued for
the same interrupt vector are collapsed into the single bit in the
IRR.

For the P6 family and Pentium processors, the IRR and ISR registers
can queue no more than two interrupts per interrupt vector and will
reject other interrupts that are received within the same vector."

Which means, that on P6/Pentium the APIC will reject a new message and
tell the sender to retry, which increases the load on the APIC bus and
nothing more.

There is no affirmative answer from Intel on that, but I think it's sane
to remove that:

1) I've looked through a bunch of other operating systems and none of
them bothers to implement this or mentiones this at all.

2) The current allocator has no enforcement for this and especially the
legacy interrupts, which are the main source of interrupts on these
P6 and older systmes, are allocated linearly in the same priority
level and just work.

3) The current machines have no problem with that at all as I verified
with some experiments.

4) AMD at least confirmed that such an issue is unknown.

5) P6 and older are dinosaurs almost 20 years EOL, so we really should
not worry about that anymore.

So this can be eliminated, which makes the allocation mechanism way
simpler.


Some other issues which are not in the way of cleanups and replacements,
but need to be looked at as well:

1) Automated affinity assignment

This only helps when the underlying device requests it and has the
matching queues per CPU. That's what the managed interrupt affinity
mechanism was made for.

In other cases the automated assignment can have really bad effects.

On the same SKL as above I made the AHCI interrupt affine to CPU3 only
which makes the kernel build slower by whopping 10% than having it
affine on CPU0. Interestingly enough irqbalanced end up with the wrong
decision as well.

So we need to be very careful about that. It depends on the device and
the driver how good 'random' placement works.

That means we need hinting from the drivers about their preferred
allocation scheme. If we don't have that then we should for now default
to the current scheme which puts the interrupt on the node on which the
device is.


2) Vector waste

All 16 legacy interrupt vectors are populated at boot and stay there
forever whether they are used or not. On most modern machines that's 10+
vectors wasted for nothing. If the APIC uses logical destination mode
that means these vectors are per default allocated on up to 8 CPUs or in
the case of clustered X2APIC on all CPUs in a cluster.

It'd be worthwhile to allocate these legacy vectors dynamically when
they are actually used. That might fail, but that's the same on devices
which use MSI etc. For legacy systems this is a non issue as there are
plenty of vectors available. On modern machines the 4-5 really used
legacy vectors are requested early during the boot process and should
not end up in a fully exhausted vector space.

Nothing urgent, but worthwhile to fix I think.

Thoughts?

Thanks,

tglx