[PATCH 2/2] x86, make check_irq_vectors_for_cpu_disable() aware of numa node irqs

From: Prarit Bhargava
Date: Tue May 13 2014 - 11:40:32 EST


Commit da6139e49c7cb0f4251265cb5243b8d220adb48d, x86: Add check for number
of available vectors before CPU down, added
check_irq_vectors_for_cpu_disable() which checks if there are enough empty
vectors to replace the vectors of a downed cpu.

After some additional testing I had noticed some odd failures with storage
irqs that I originally had thought were likely due to incorrect
CPU_DOWN_FAILED calls in the kernel. The manifest themselves as disk IO
errors in the kernel, and eventually lead to the system filesystems
remounting as read only.

It turns out that these errors are not issues with CPU_DOWN_FAILED
callbacks and the check_irq_vectors_for_cpu_disable() is still
insufficient in the way it checks for available vectors which leads to the
possiblity of "orphaned" irqs (that is, active irqs not assigned to a cpu)
after a CPU down.

The current check in check_irq_vectors_for_cpu_disable() does not take into
account cases where irqs are restricted to specific numa nodes. Currently,
the code assumes that an irq can be placed on any cpu, however, if the
PCI device that the irq is assigned to is on a specific node, the irq must
also be moved to another cpu on that node.

This patch uses the irq's node information and affinity mask to determine
where an node-specific irq can transition to. This is done by first
creating a table of cpus and the number of empty vectors each cpu has.
Then the code traverses the list of irqs on the down'd CPU and "allocates"
an irq to an empty vector slot on the cpus. If there are no available
vectors the code returns an error. Note that the actual assignment of the
irqs is still done later in the cpu disable code path.

At the same time I've created a helper function, _evaluate_irq_for_cpu_disable()that is only called in from check_irq_vectors_for_cpu_disable(). This makes
the code alot easier to read.

Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: "H. Peter Anvin" <hpa@xxxxxxxxx>
Cc: x86@xxxxxxxxxx
Cc: Prarit Bhargava <prarit@xxxxxxxxxx>
Cc: Seiji Aguchi <seiji.aguchi@xxxxxxx>
Cc: Andi Kleen <ak@xxxxxxxxxxxxxxx>
Cc: "K. Y. Srinivasan" <kys@xxxxxxxxxxxxx>
Cc: "Steven Rostedt (Red Hat)" <rostedt@xxxxxxxxxxx>
Cc: Yinghai Lu <yinghai@xxxxxxxxxx>
Signed-off-by: Prarit Bhargava <prarit@xxxxxxxxxx>
Signed-off-by: H. Peter Anvin <hpa@xxxxxxxxxxxxxxx>
---
arch/x86/kernel/irq.c | 139 +++++++++++++++++++++++++++++++++++--------------
1 file changed, 99 insertions(+), 40 deletions(-)

diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index d03ff8f..7d273be 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -274,67 +274,113 @@ EXPORT_SYMBOL_GPL(vector_used_by_percpu_irq);

#ifdef CONFIG_HOTPLUG_CPU

-/* These two declarations are only used in check_irq_vectors_for_cpu_disable()
+/*
+ * These two declarations are only used in check_irq_vectors_for_cpu_disable()
* below, which is protected by stop_machine(). Putting them on the stack
* results in a stack frame overflow. Dynamically allocating could result in a
* failure so declare these two cpumasks as global.
*/
static struct cpumask affinity_new, online_new;

+/* This array is used to keep track of how many empty vectors each cpu has. */
+static int empty_vectors[NR_CPUS];
+
/*
- * This cpu is going to be removed and its vectors migrated to the remaining
- * online cpus. Check to see if there are enough vectors in the remaining cpus.
- * This function is protected by stop_machine().
+ * Helper function for check_irqs_vectors_for_cpu_disable(). Returns 0 if the
+ * irq doesn't need to be reassigned to a new cpu, returns 1 if the irq does
+ * need to be reassigned to a new cpu, and an -error if there is no room for
+ * the irq.
*/
-int check_irq_vectors_for_cpu_disable(void)
+static int _evaluate_irq_for_cpu_disable(int irq)
{
- int irq, cpu;
- unsigned int this_cpu, vector, this_count, count;
+ int empty_vectors_set;
+ unsigned int this_cpu;
+ unsigned int vector_cpu;
struct irq_desc *desc;
struct irq_data *data;

- this_cpu = smp_processor_id();
- cpumask_copy(&online_new, cpu_online_mask);
- cpu_clear(this_cpu, online_new);
+ if (irq < 0)
+ return 0;

- this_count = 0;
- for (vector = FIRST_EXTERNAL_VECTOR; vector < NR_VECTORS; vector++) {
- irq = __this_cpu_read(vector_irq[vector]);
- if (irq >= 0) {
- desc = irq_to_desc(irq);
- data = irq_desc_get_irq_data(desc);
- cpumask_copy(&affinity_new, data->affinity);
- cpu_clear(this_cpu, affinity_new);
+ this_cpu = smp_processor_id();
+ desc = irq_to_desc(irq);
+ data = irq_desc_get_irq_data(desc);
+ cpumask_copy(&affinity_new, data->affinity);
+ cpu_clear(this_cpu, affinity_new);

- /* Do not count inactive or per-cpu irqs. */
- if (!irq_has_action(irq) || irqd_is_per_cpu(data))
- continue;
+ /* Do not count inactive or per-cpu irqs. */
+ if (!irq_has_action(irq) || irqd_is_per_cpu(data))
+ return 0;

- /*
- * A single irq may be mapped to multiple
- * cpu's vector_irq[] (for example IOAPIC cluster
- * mode). In this case we have two
- * possibilities:
- *
- * 1) the resulting affinity mask is empty; that is
- * this the down'd cpu is the last cpu in the irq's
- * affinity mask, or
- *
- * 2) the resulting affinity mask is no longer
- * a subset of the online cpus but the affinity
- * mask is not zero; that is the down'd cpu is the
- * last online cpu in a user set affinity mask.
- */
- if (cpumask_empty(&affinity_new) ||
- !cpumask_subset(&affinity_new, &online_new))
- this_count++;
+ if (desc->irq_data.node == NUMA_NO_NODE) {
+ /*
+ * A single irq may be mapped to multiple cpu's vector_irq[]
+ * (for example IOAPIC cluster mode). In this case we have two
+ * possibilities:
+ *
+ * 1) the resulting affinity mask is empty; that is this the
+ * down'd cpu is the last cpu in the irq's affinity mask, or
+ *
+ * 2) the resulting affinity mask is no longer a subset of the
+ * online cpus but the affinity mask is not zero; that is the
+ * down'd cpu is the last online cpu in a user set affinity
+ * mask.
+ */
+ if (cpumask_empty(&affinity_new) ||
+ !cpumask_subset(&affinity_new, &online_new))
+ return 1;
+ } else {
+ /*
+ * In this case, the irq can be mapped to only the CPUs in
+ * the affinity mask. If the mask is empty, then there are
+ * no other available cpus.
+ */
+ if (cpumask_empty(&affinity_new)) {
+ pr_warn("CPU %d disable failed: IRQ %u has no availble CPUS to transition IRQ.\n",
+ this_cpu, irq);
+ return -ERANGE;
+ }
+ /* Check to see if there are any available vectors. */
+ for_each_cpu(vector_cpu, &affinity_new) {
+ empty_vectors_set = 0;
+ if (empty_vectors[vector_cpu] > 0) {
+ empty_vectors[vector_cpu]--;
+ empty_vectors_set = 1;
+ break;
+ }
}
+ if (!empty_vectors_set) {
+ pr_warn("CPU %d disable failed: IRQ %u has no available CPUS with available vectors.\n",
+ this_cpu, irq);
+ return -ERANGE;
+ }
+ return 1;
}
+ return 0;
+}
+
+
+/*
+ * This cpu is going to be removed and its vectors migrated to the remaining
+ * online cpus. Check to see if there are enough vectors in the remaining cpus.
+ * This function is protected by stop_machine().
+ */
+int check_irq_vectors_for_cpu_disable(void)
+{
+ int irq, cpu, ret;
+ unsigned int this_cpu, vector, this_count, count, curr_count;
+
+ this_cpu = smp_processor_id();
+ cpumask_copy(&online_new, cpu_online_mask);
+ cpu_clear(this_cpu, online_new);

count = 0;
for_each_online_cpu(cpu) {
- if (cpu == this_cpu)
+ if (cpu == this_cpu) {
+ /* we're never assigning irqs to the down'd cpu */
+ empty_vectors[cpu] = -1;
continue;
+ }

/*
* assign_irq_vector() only scan per_cpu vectors from
@@ -345,6 +391,7 @@ int check_irq_vectors_for_cpu_disable(void)
* IRQ_MOVE_CLEANUP_VECTOR (0x20)
* Don't count those as available vectors.
*/
+ curr_count = 0;
for (vector = FIRST_EXTERNAL_VECTOR;
vector < first_system_vector; vector++) {
if (test_bit(vector, used_vectors))
@@ -352,8 +399,20 @@ int check_irq_vectors_for_cpu_disable(void)

if (per_cpu(vector_irq, cpu)[vector] < 0) {
count++;
+ curr_count++;
}
}
+ empty_vectors[cpu] = curr_count;
+ }
+
+ this_count = 0;
+ for (vector = FIRST_EXTERNAL_VECTOR; vector < first_system_vector;
+ vector++) {
+ irq = __this_cpu_read(vector_irq[vector]);
+ ret = _evaluate_irq_for_cpu_disable(irq);
+ if (ret < 0)
+ return ret;
+ this_count += ret;
}

if (count < this_count) {
--
1.7.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/