Re: smp_call_function_single lockups

From: Linus Torvalds
Date: Fri Feb 20 2015 - 11:49:45 EST

On Fri, Feb 20, 2015 at 1:30 AM, Ingo Molnar <mingo@xxxxxxxxxx> wrote:
> So if my memory serves me right, I think it was for local
> APICs, and even there mostly it was a performance issue: if
> an IO-APIC sent more than 2 IRQs per 'level' to a local
> APIC then the IO-APIC might be forced to resend those IRQs,
> leading to excessive message traffic on the relevant
> hardware bus.

Hmm. I have a distinct memory of interrupts actually being lost, but I
really can't find anything to support that memory, so it's probably
some drug-induced confusion of mine. I don't find *anything* about
interrupt "levels" any more in modern Intel documentation on the APIC,
but maybe I missed something. But it might all have been an IO-APIC

> I think you got it right.
> So the principle of EOI acknowledgement from the OS to the
> local APIC is specific to the IRQ that raised the interrupt
> and caused the vector to be executed, so it's not possible
> to ack the 'wrong' IRQ.

I agree - but only for cases where we actualyl should ACK in the first
place. The LAPIC knows what its own priorities were, so it always just
clears the highest-priority ISR bit.

> So I _think_ it's not possible to accidentally acknowledge
> a pending IRQ that has not been issued to the CPU yet
> (unless we have hardirqs enabled), just by writing stray
> EOIs to the local APIC. So in that sense the ExtInt irq0
> case should be mostly harmless.

It must clearly be mostly harmless even if I'm right, since we've
always done the ACK for it, as far as I can tell. So you're probably
right, and it doesn't matter.

In particular, it would depend on exactly when the ISR bit actually
gets set in the APIC. If it gets set only after the CPU has actually
*accepted* the interrupt, then it should not be possible for a
spurious EOI write to screw things up, because it's still happening in
the context of the previously executing interrupt, so a new ISR bit
hasn't been set yet, and any old ISR bits must be lower-priority than
the currently executing interrupt routine (that were interrupted by a
higher-priority one).

That sounds like the likely scenario. It just worries me. And my test
patch did actually trigger, even if it only seemed to trigger during
device probing (when the probing itself probably caused and then
cleared an interrupt).

> I also fully share your frustration about the level of
> obfuscation the various APIC drivers display today.
> The lack of a simple single-IPI implementation is annoying
> as well - when that injury was first inflicted with
> clustered APICs I tried to resist, but AFAICR there were
> some good hardware arguments why it cannot be kept and I
> gave up.
> If you agree then I can declare a feature stop for new
> hardware support (that isn't a stop-ship issue for users)
> until it's all cleaned up for real, and Thomas started some
> of that work already.

Well, the attached patch for that seems pretty trivial. And seems to
work for me (my machine also defaults to x2apic clustered mode), and
allows the APIC code to start doing a "send to specific cpu" thing one
by one, since it falls back to the send_IPI_mask() function if no
individual CPU IPI function exists.

NOTE! There's a few cases in arch/x86/kernel/apic/vector.c that also
do that "apic->send_IPI_mask(cpumask_of(i), .." thing, but they aren't
that important, so I didn't bother with them.

NOTE2! I've tested this, and it seems to work, but maybe there is
something seriously wrong. I skipped the "disable interrupts" part
when doing the "send_IPI", for example, because I think it's entirely
unnecessary for that case. But this has certainly *not* gotten any
real stress-testing.

But /proc/interrupts shows thousands of rescheduling IPI's, so this
should all have triggered.

arch/x86/include/asm/apic.h | 1 +
arch/x86/kernel/apic/x2apic_cluster.c | 9 +++++++++
arch/x86/kernel/smp.c | 16 ++++++++++++++--
3 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 92003f3c8a42..5d47ffcfa65d 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -296,6 +296,7 @@ struct apic {
unsigned int *apicid);

/* ipi */
+ void (*send_IPI)(int cpu, int vector);
void (*send_IPI_mask)(const struct cpumask *mask, int vector);
void (*send_IPI_mask_allbutself)(const struct cpumask *mask,
int vector);
diff --git a/arch/x86/kernel/apic/x2apic_cluster.c b/arch/x86/kernel/apic/x2apic_cluster.c
index e658f21681c8..177c4fb027a1 100644
--- a/arch/x86/kernel/apic/x2apic_cluster.c
+++ b/arch/x86/kernel/apic/x2apic_cluster.c
@@ -23,6 +23,14 @@ static inline u32 x2apic_cluster(int cpu)
return per_cpu(x86_cpu_to_logical_apicid, cpu) >> 16;

+static void x2apic_send_IPI(int cpu, int vector)
+ u32 dest = per_cpu(x86_cpu_to_logical_apicid, cpu);
+ x2apic_wrmsr_fence();
+ __x2apic_send_IPI_dest(dest, vector, APIC_DEST_LOGICAL);
static void
__x2apic_send_IPI_mask(const struct cpumask *mask, int vector, int apic_dest)
@@ -266,6 +274,7 @@ static struct apic apic_x2apic_cluster = {

.cpu_mask_to_apicid_and = x2apic_cpu_mask_to_apicid_and,

+ .send_IPI = x2apic_send_IPI,
.send_IPI_mask = x2apic_send_IPI_mask,
.send_IPI_mask_allbutself = x2apic_send_IPI_mask_allbutself,
.send_IPI_allbutself = x2apic_send_IPI_allbutself,
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index be8e1bde07aa..78c9b12d892a 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -114,6 +114,18 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
static bool smp_no_nmi_ipi = false;

+ * Helper wrapper: not all apic definitions support sending to
+ * a single CPU, so we fall back to sending to a mask.
+ */
+static void send_IPI_cpu(int cpu, int vector)
+ if (apic->send_IPI)
+ apic->send_IPI(cpu, vector);
+ else
+ apic->send_IPI_mask(cpumask_of(cpu), vector);
* this function sends a 'reschedule' IPI to another CPU.
* it goes straight through and wastes no time serializing
* anything. Worst case is that we lose a reschedule ...
@@ -124,12 +136,12 @@ static void native_smp_send_reschedule(int cpu)
- apic->send_IPI_mask(cpumask_of(cpu), RESCHEDULE_VECTOR);
+ send_IPI_cpu(cpu, RESCHEDULE_VECTOR);

void native_send_call_func_single_ipi(int cpu)
- apic->send_IPI_mask(cpumask_of(cpu), CALL_FUNCTION_SINGLE_VECTOR);

void native_send_call_func_ipi(const struct cpumask *mask)