Re: smp_call_function_single lockups

From: Ingo Molnar
Date: Fri Feb 20 2015 - 04:30:15 EST

* Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

> On Thu, Feb 19, 2015 at 9:39 AM, Linus Torvalds
> <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> > On Thu, Feb 19, 2015 at 8:59 AM, Linus Torvalds
> > <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> >>
> >> Are there known errata for the x2apic?
> >
> > .. and in particular, do we still have to worry about
> > the traditional local apic "if there are more than two
> > pending interrupts per priority level, things get lost"
> > problem?
> >
> > I forget the exact details. Hopefully somebody
> > remembers.
> I can't find it in the docs. I find the "two-entries per
> vector", but not anything that is per priority level
> (group of 16 vectors). Maybe that was the IO-APIC, in
> which case it's immaterial for IPI's.

So if my memory serves me right, I think it was for local
APICs, and even there mostly it was a performance issue: if
an IO-APIC sent more than 2 IRQs per 'level' to a local
APIC then the IO-APIC might be forced to resend those IRQs,
leading to excessive message traffic on the relevant
hardware bus.

( I think the 'resend' was automatic in this case, i.e. a
hardware fallback for a CPU side resource shortage, and
it could not result in actually lost IRQs. I never saw
this documented properly, so people inside Intel or AMD
would be in a better position to comment on this ... I
might be mis-remembering this or confusing different
bugs. )

> However, having now mostly re-acquainted myself with the
> APIC details, it strikes me that we do have some oddities
> here.
> In particular, a few interrupt types are very special:
> NMI, SMI, INIT, ExtINT, or SIPI are handled early in the
> interrupt acceptance logic, and are sent directly to the
> CPU core, without going through the usual intermediate
> IRR/ISR dance.
> And why might this matter? It's important because it
> means that those kinds of interrupts must *not* do the
> apic EOI that ack_APIC_irq() does.
> And we correctly don't do ack_APIC_irq() for NMI etc, but
> it strikes me that ExtINT is odd and special.
> I think we still use ExtINT for some odd cases. We used
> to have some magic with the legacy timer interrupt, for
> example. And I think they all go through the normal
> "do_IRQ()" logic regardless of whether they are ExtINT or
> not.
> Now, what happens if we send an EOI for an ExtINT
> interrupt? It basically ends up being a spurious IPI. And
> I *think* that what normally happens is absolutely
> nothing at all. But if in addition to the ExtINT, there
> was a pending IPI (or other pending ISR bit set), maybe
> we lose interrupts..


I think you got it right.

So the principle of EOI acknowledgement from the OS to the
local APIC is specific to the IRQ that raised the interrupt
and caused the vector to be executed, so it's not possible
to ack the 'wrong' IRQ.

But technically the EOI is state-less, i.e. (as you know)
we write a constant value to a local APIC register without
indicating which vector or external IRQ we meant. The OS
wants to ack 'the IRQ that we are executing currently', but
this leaves the situation a bit confused in cases where for
example an IRQ handler enables IRQs, another IRQ comes in
and stays unacked.

So I _think_ it's not possible to accidentally acknowledge
a pending IRQ that has not been issued to the CPU yet
(unless we have hardirqs enabled), just by writing stray
EOIs to the local APIC. So in that sense the ExtInt irq0
case should be mostly harmless.

But I could be wrong :-/


So my suggestion for this bug would be:

The 'does a stray EOI matter' question could also be tested
by deliberately writing two EOIs instead of just one - does
this trigger the bug faster?

Then perhaps try to make sure that no hardirqs get ever
enabled in an irq handler, and figure out whether any of
the IRQs in question are edge triggered - but AFAICS it
could be 'any' IRQ handler or flow causing the problem,


I also fully share your frustration about the level of
obfuscation the various APIC drivers display today.

The lack of a simple single-IPI implementation is annoying
as well - when that injury was first inflicted with
clustered APICs I tried to resist, but AFAICR there were
some good hardware arguments why it cannot be kept and I
gave up.

If you agree then I can declare a feature stop for new
hardware support (that isn't a stop-ship issue for users)
until it's all cleaned up for real, and Thomas started some
of that work already.

> .. and it's entirely possible that I'm just completely
> full of shit. Who is the poor bastard who has worked most
> with things like ExtINT, and can educate me? I'm adding
> Ingo, hpa and Jiang Liu as primary contacts..

So the buck stops at my desk, but any help is welcome!


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at