Re: APIC error on SMP machine

From: James Cleverdon
Date: Tue Sep 30 2003 - 20:53:46 EST


On Tuesday 30 September 2003 2:42 pm, Chris Rankin wrote:
> Linux-2.4.22-SMP, 1 GB RAM, devfs, gcc-3.2.3.
>
> Hi,
>
> Today, my dual PIII (Coppermine) refused to boot, and wrote a large number
> of these messages to the serial console instead:
>
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
> APIC error on CPU1: 04(04)
>
> Can anyone tell me what these might mean, please? The kernel source implies
> that it's a "Send accept error", but this doesn't help me in an "Ah, I can
> fix that!" sense.
>
> Does this APIC error just mean that the CPU is unhappy in this slot, and is
> refusing to listen to the motherboard? Or is the motherboard refusing to
> listen to the CPU?

Neither. An APIC send accept error means that when trying to send an
interrupt, it was not accepted by the target. In this case, the target is a
CPU, either your other CPU or the same one (a CPU can send itself an
interrupt).

While there are several reasons why this can happen, the most common ones are:

1) The target CPU is "full". The local APIC on P54Cs through P3s only has two
interrupt latches per interrupt "level", which is the high nibble of the IRQ
vector number. So, if a CPU had already latched interrupt vectors 0x30 and
0x3A, it would have to reject any other 0x3X vector that was sent until it
could service one of the two latched vectors.

You can force this to happen by manually binding too many IRQs that happen to
be on the same "level" to one CPU, then causing a lot of interrupt traffic on
those devices.

In order to avoid this problem, Linux spreads the IRQs among as many vector
levels as possible. Still, the vector assignment is done before any devices
have requested interrupts. You may get unlucky and have 3 devices on one
level.

2) The interrupt cannot be delivered because something is wrong with it. This
can happen if the kernel screws up and picks "clustered" APIC mode on a
"flat" system or vice versa. A dual P3 system should be flat. Check your
dmesg log to make sure it was properly detected. (This seldom happens unless
you're doing interrupt development work in Linux.)

3) Maybe the other CPU is broken and physically cannot accept the interrupt.
Do any previous kernels boot?

> Background:
> This machine has been misbehaving for a while. I thought I had worked
> around the problem by underclocking the FSB from 133 MHz to 100 MHz, but
> that now looks like it was just a "reprieve". I have tried running "nosmp",
> "pci=noacpi" and "noapic pci=noacpi" without success, and have resorted to
> yanking the CPU out of this slot entirely. (I suspect that the CPU is fine,
> however.) I have also restored the FSB to 133 MHz, so I am currently
> running the SMP kernel on a single 933 MHz PIII.
>
> Cheers,
> Chris
>
> -


--
James Cleverdon
IBM xSeries Linux Solutions
{jamesclv(Unix, preferred), cleverdj(Notes)} at us dot ibm dot comm
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/