Re: smp_call_function_single lockups

From: Daniel J Blueman
Date: Sun Feb 22 2015 - 04:00:33 EST

On Saturday, February 21, 2015 at 3:50:05 AM UTC+8, Ingo Molnar wrote:
> * Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> > On Fri, Feb 20, 2015 at 1:30 AM, Ingo Molnar <mingo@xxxxxxxxxx> wrote:
> > >
> > > So if my memory serves me right, I think it was for
> > > local APICs, and even there mostly it was a performance
> > > issue: if an IO-APIC sent more than 2 IRQs per 'level'
> > > to a local APIC then the IO-APIC might be forced to
> > > resend those IRQs, leading to excessive message traffic
> > > on the relevant hardware bus.
> >
> > Hmm. I have a distinct memory of interrupts actually
> > being lost, but I really can't find anything to support
> > that memory, so it's probably some drug-induced confusion
> > of mine. I don't find *anything* about interrupt "levels"
> > any more in modern Intel documentation on the APIC, but
> > maybe I missed something. But it might all have been an
> > IO-APIC thing.
> So I just found an older discussion of it:
> while it's not a comprehensive description, it matches what
> I remember from it: with 3 vectors within a level of 16
> vectors we'd get excessive "retries" sent by the IO-APIC
> through the (then rather slow) APIC bus.
> ( It was possible for the same phenomenon to occur with
> IPIs as well, when a CPU sent an APIC message to another
> CPU, if the affected vectors were equal modulo 16 - but
> this was rare IIRC because most systems were dual CPU so
> only two IPIs could have occured. )
> > Well, the attached patch for that seems pretty trivial.
> > And seems to work for me (my machine also defaults to
> > x2apic clustered mode), and allows the APIC code to start
> > doing a "send to specific cpu" thing one by one, since it
> > falls back to the send_IPI_mask() function if no
> > individual CPU IPI function exists.
> >
> > NOTE! There's a few cases in
> > arch/x86/kernel/apic/vector.c that also do that
> > "apic->send_IPI_mask(cpumask_of(i), .." thing, but they
> > aren't that important, so I didn't bother with them.
> >
> > NOTE2! I've tested this, and it seems to work, but maybe
> > there is something seriously wrong. I skipped the
> > "disable interrupts" part when doing the "send_IPI", for
> > example, because I think it's entirely unnecessary for
> > that case. But this has certainly *not* gotten any real
> > stress-testing.

> I'm not so sure about that aspect: I think disabling IRQs
> might be necessary with some APICs (if lower levels don't
> disable IRQs), to make sure the 'local APIC busy' bit isn't
> set:
> we typically do a wait_icr_idle() call before sending an
> IPI - and if IRQs are not off then the idleness of the APIC
> might be gone. (Because a hardirq that arrives after a
> wait_icr_idle() but before the actual IPI sending sent out
> an IPI and the queue is full.)

The Intel SDM [1] and AMD F15h BKDG [2] state that IPIs are queued, so the wait_icr_idle() polling is only necessary on PPro and older, and maybe then to avoid delivery retry. This unnecessarily ties up the IPI caller, so we bypass the polling in the Numachip APIC driver IPI-to-self path.

On Linus's earlier point, with the large core counts on Numascale systems, I previously implemented a shortcut to allow single IPIs to bypass all the cpumask generation and walking; it's way down on my list, but I'll see if I can generalise and present a patch series at some point if interested?


-- [1] Intel SDM 3, p10-30

If more than one interrupt is generated with the same vector number, the local APIC can set the bit for the vector both in the IRR and the ISR. This means that for the Pentium 4 and Intel Xeon processors, the IRR and ISR can queue two interrupts for each interrupt vector: one in the IRR and one in the ISR. Any additional interrupts issued for the same interrupt vector are collapsed into the single bit in the IRR. For the P6 family and Pentium processors, the IRR and ISR registers can queue no more than two interrupts per interrupt vector and will reject other interrupts that are received within the same vector.

-- [2] AMD Fam15h BKDG p470

DS: interrupt delivery status. Read-only. Reset: 0. In xAPIC mode this bit is set to indicate that the interrupt has not yet been accepted by the destination core(s). 0=Idle. 1=Send pending. Reserved in x2APIC mode. Software may repeatedly write ICRL without polling the DS bit; all requested IPIs will be delivered.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at