[PATCH 4/4][RFC v2] x86/apic: Spread the vectors by choosing the idlest CPU

From: Chen Yu
Date: Fri Sep 01 2017 - 01:02:47 EST

Next message: Greg Kroah-Hartman: "Re: [PATCH 4.9 00/12] 4.9.47-stable review"
Previous message: Greg Kroah-Hartman: "Re: [PATCH 3.18 00/24] 3.18.69-stable review"
In reply to: Chen Yu: "[PATCH 2/4][RFC v2] x86/apic: Record the number of vectors assigned on a CPU"
Next in thread: Thomas Gleixner: "Re: [PATCH 4/4][RFC v2] x86/apic: Spread the vectors by choosing the idlest CPU"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This is the major logic to spread the vectors on different CPUs.
The main idea is to choose the 'idlest' CPU which has assigned
the least number of vectors as the candidate/hint for the vector
allocation domain, in the hope that the vector allocation domain
could leverage this hint to generate corresponding cpumask.

One of the requirements to do this vector spreading work comes from the
following hibernation problem found on a 16 cores server:

CPU 31 disable failed: CPU has 62 vectors assigned and there
are only 0 available.

2 issues were found after investigation:

1. The network driver has declared many vector resources via
pci_enable_msix_range(), say, this driver might likely want
to reserve 6 per logical CPU, then there would be 192 of them.
2. Besides, most of the vectors reserved by this driver are assigned
on CPU0 due to the current code strategy, so there would be
insufficient slots left on CPU0 to receive any migrated IRQs
during CPU offine.

In theory after the commit c5cb83bb337c ("genirq/cpuhotplug: Handle
managed IRQs on CPU hotplug") the issue 1 has been solved with the
managed interrupt mechanism for multi queue devices, which no longer
migrate these interrupts. It simply shuts them down and restarts
them when the CPU comes back online.

However, according to the implementation of this network driver,
the commit mentioned above does not fully fix the issue 1.
Here's the framework of the network driver:

step 1. Reserved enough irq vectors and corresponding IRQs.
step 2. If the network is activated, invoke request_irq() to
register the handler.
step 3. Invoke set_affinity() to spread the IRQs onto different
CPUs, thus to spread the vectors too.

The problem is, if the network cable is not connected, step 2
and 3 will not get invoked, thus the IRQ vectors will not spread
on different CPUs and will still be allocated on CPU0. As a result
the CPUs will still get the offline failure.

Previously there were some discussion in the thread [1] about the
vector spread, and here are the suggestion from Thomas:

Q1:
Rewrite that allocation code from scratch, use per cpu bitmaps,
so we can do fast search over masks and keep track of
the vectors which are used/free per cpu.
A1:
per cpu bitmap was onced considered but this might not be
as fast as the solution proposed here. That is, if we introduce the
per cpu bitmap, we need to count the number of vectors assigned on
each CPUs by cpumask_weight() and sort them in order to get the
CPUs who have the least number of vectors assigned. By contrast,
if we introduce the bitmaps for each vector number, we can get the
'idle' CPU at the cost of nearly O(1).

One scenario of this patch can be illustrated below:

a) After initialization, vector_mask[0] is set with {CPU0...CPU31},
other vector_mask[i] remain zero, which means that, CPU0...CPU31
have zero vectors assigned.
b) During driver probe, CPU0 is chosen to be assigned with one vector.
Thus vector_mask[0] becomes {CPU1...CPU31}, and vector_mask[1] becomes
{CPU0}.
c) The next time the system tries to assign another vector, it searches from
vector[0], because it is non-empty, CPU1 is chosen to place
the vector. Thus vector_mask[0] becomes {CPU2...CPU31}, and vector_mask[1]
becomes {CPU0, CPU1}.
d) and so on, the vectors assignment will spread on different CPUs.

Q2:
"respect the allocation domains"
A2:
This solution provides a CPU as the param for the vector allocation domains,
in the hope that the domains can leverage this hint to generate the cpumask
with less vectors assigned. But yes, the CPU we select upfront gets fed into
apic->vector_allocation_domain() might just say no because the CPU
does not belong to it, but anyway this should somehow spread the vectors,
although I can not find out a graceful solution to this.
BTW, the vector allocation domain is default_vector_allocation_domain()
in our case, thus it works.

Q3:
"set a preference for the node on which the interrupt was allocated"
A3:
This can be achieved by further optimization in the following way:
a) Also record the number of vectors used per node.
b) Each time invoking __assign_irq_vector, adjust the input mask
by cpumask_and(mask, mask, node_mask_with_least_vectors())

[1] https://marc.info/?l=linux-kernel&m=149867663002202
or
[1] https://patchwork.kernel.org/patch/9725227/

Thanks for your time.

Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: "H. Peter Anvin" <hpa@xxxxxxxxx>
Cc: "Rafael J. Wysocki" <rjw@xxxxxxxxxxxxx>
Cc: Len Brown <lenb@xxxxxxxxxx>
Cc: Dan Williams <dan.j.williams@xxxxxxxxx>
Signed-off-by: Chen Yu <yu.c.chen@xxxxxxxxx>
---
arch/x86/kernel/apic/vector.c | 26 ++++++++++++++++++++++++--
1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index b60cc66..c7f0e8b 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -145,6 +145,28 @@ static void free_apic_chip_data(struct apic_chip_data *data)
}
}

+static int pick_next_cpu_hint(const struct cpumask *mask)
+{
+ int i;
+ struct cpumask search, tmp;
+
+ cpumask_and(&search, mask, cpu_online_mask);
+ /*
+ * Search in the order from vectors_alloc_mask[0] to
+ * vectors_alloc_mask[255], try to find an intersection
+ * between the mask and vectors_alloc_mask[],
+ * return the first CPU in this intersection, thus
+ * this CPU has the least number of vectors allocated.
+ */
+ for (i = 0; i < NR_VECTORS; i++) {
+ cpumask_and(&tmp, &search, vectors_alloc_mask[i]);
+ if (!cpumask_empty(&tmp))
+ return cpumask_first(&tmp);
+ }
+ /* This should not happened...*/
+ return cpumask_first_and(mask, cpu_online_mask);
+}
+
static int __assign_irq_vector(int irq, struct apic_chip_data *d,
const struct cpumask *mask,
struct irq_data *irqdata)
@@ -175,7 +197,7 @@ static int __assign_irq_vector(int irq, struct apic_chip_data *d,
/* Only try and allocate irqs on cpus that are present */
cpumask_clear(d->old_domain);
cpumask_clear(searched_cpumask);
- cpu = cpumask_first_and(mask, cpu_online_mask);
+ cpu = pick_next_cpu_hint(mask);
while (cpu < nr_cpu_ids) {
int new_cpu, offset;

@@ -247,7 +269,7 @@ static int __assign_irq_vector(int irq, struct apic_chip_data *d,
*/
cpumask_or(searched_cpumask, searched_cpumask, vector_cpumask);
cpumask_andnot(vector_cpumask, mask, searched_cpumask);
- cpu = cpumask_first_and(vector_cpumask, cpu_online_mask);
+ cpu = pick_next_cpu_hint(vector_cpumask);
continue;
}
return -ENOSPC;
--
2.7.4

Next message: Greg Kroah-Hartman: "Re: [PATCH 4.9 00/12] 4.9.47-stable review"
Previous message: Greg Kroah-Hartman: "Re: [PATCH 3.18 00/24] 3.18.69-stable review"
In reply to: Chen Yu: "[PATCH 2/4][RFC v2] x86/apic: Record the number of vectors assigned on a CPU"
Next in thread: Thomas Gleixner: "Re: [PATCH 4/4][RFC v2] x86/apic: Spread the vectors by choosing the idlest CPU"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]