Re: False positive "do_IRQ: #.55 No irq handler for vector" messages on AMD ryzen based laptops

From: Lendacky, Thomas
Date: Tue Mar 05 2019 - 14:31:41 EST


On 3/5/19 1:19 PM, Hans de Goede wrote:
> Hi,
>
> On 05-03-19 17:02, Hans de Goede wrote:
>> Hi,
>>
>> On 05-03-19 15:06, Lendacky, Thomas wrote:
>>> On 3/3/19 4:57 AM, Hans de Goede wrote:
>>>> Hi,
>>>>
>>>> On 21-02-19 13:30, Hans de Goede wrote:
>>>>> Hi,
>>>>>
>>>>> On 19-02-19 22:47, Lendacky, Thomas wrote:
>>>>>> On 2/19/19 3:01 PM, Thomas Gleixner wrote:
>>>>>>> Hans,
>>>>>>>
>>>>>>> On Tue, 19 Feb 2019, Hans de Goede wrote:
>>>>>>>
>>>>>>> Cc+: ACPI/AMD folks
>>>>>>>
>>>>>>>> Various people are reporting false positive "do_IRQ: #.55 No irq
>>>>>>>> handler for
>>>>>>>> vector"
>>>>>>>> messages on AMD ryzen based laptops, see e.g.:
>>>>>>>>
>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1551605
>>>>>>>>
>>>>>>>> Which contains this dmesg snippet:
>>>>>>>>
>>>>>>>> Feb 07 20:14:29 localhost.localdomain kernel: smp: Bringing up
>>>>>>>> secondary CPUs
>>>>>>>> ...
>>>>>>>> Feb 07 20:14:29 localhost.localdomain kernel: x86: Booting SMP
>>>>>>>> configuration:
>>>>>>>> Feb 07 20:14:29 localhost.localdomain kernel: .... node #0,
>>>>>>>> CPUs:ÂÂÂÂÂ #1
>>>>>>>> Feb 07 20:14:29 localhost.localdomain kernel: do_IRQ: 1.55 No irq
>>>>>>>> handler for
>>>>>>>> vector
>>>>>>>> Feb 07 20:14:29 localhost.localdomain kernel:Â #2
>>>>>>>> Feb 07 20:14:29 localhost.localdomain kernel: do_IRQ: 2.55 No irq
>>>>>>>> handler for
>>>>>>>> vector
>>>>>>>> Feb 07 20:14:29 localhost.localdomain kernel:Â #3
>>>>>>>> Feb 07 20:14:29 localhost.localdomain kernel: do_IRQ: 3.55 No irq
>>>>>>>> handler for
>>>>>>>> vector
>>>>>>>> Feb 07 20:14:29 localhost.localdomain kernel: smp: Brought up 1 node,
>>>>>>>> 4 CPUs
>>>>>>>> Feb 07 20:14:29 localhost.localdomain kernel: smpboot: Max logical
>>>>>>>> packages: 1
>>>>>>>> Feb 07 20:14:29 localhost.localdomain kernel: smpboot: Total of 4
>>>>>>>> processors
>>>>>>>> activated (15968.49 BogoMIPS)
>>>>>>>>
>>>>>>>> It seems that we get an IRQ for each CPU as we bring it online,
>>>>>>>> which feels to me like it is some sorta false-positive.
>>>>>>>
>>>>>>> Sigh, that looks like BIOS value add again.
>>>>>>>
>>>>>>> It's not a false positive. Something _IS_ sending a vector 55 to these
>>>>>>> CPUs
>>>>>>> for whatever reason.
>>>>>>>
>>>>>>
>>>>>> I remember seeing something like this in the past and it turned out
>>>>>> to be
>>>>>> a BIOS issue. BIOS was enabling the APs to interact with the legacy
>>>>>> 8259
>>>>>> interrupt controller when only the BSP should. During POST the APs were
>>>>>> exposed to ExtINT/INTR events as a result of the mis-configuration
>>>>>> (probably due to a UEFI timer-tick using the 8259) and this left a
>>>>>> pending
>>>>>> ExtINT/INTR interrupt latched on the APs.
>>>>>>
>>>>>> When the APs were started by the OS, the latched ExtINT/INTR
>>>>>> interrupt is
>>>>>> processed shortly after the OS enables interrupts. The AP then
>>>>>> queries the
>>>>>> 8259 to identify the vector number (which is the value of the 8259's
>>>>>> ICW2
>>>>>> register + the IRQ level). The master 8259's ICW2 was set to 0x30 and,
>>>>>> since no interrupts are actually pending, the 8259 will respond with
>>>>>> IRQ7
>>>>>> (spurious interrupt) yielding a vector of 0x37 or 55.
>>>>>>
>>>>>> The OS was not expecting vector 55 and printed the message.
>>>>>>
>>>>>> ÂÂFrom the Intel Developer's Manual: Vol 3a, Section 10.5.1:
>>>>>> "Only one processor in the system should have an LVT entry
>>>>>> configured to
>>>>>> use the ExtINT delivery mode."
>>>>>>
>>>>>> Not saying this is the problem, but very well could be.
>>>>>
>>>>> That sounds like a likely candidate, esp. also since this only happens
>>>>> once per CPU when we first only the CPU.
>>>>>
>>>>> Can you provide me with a patch with some printk-s / pr_debugs to
>>>>> test for this, then I can build a kernel with that patch added and
>>>>> we can see if your hypothesis is right.
>>>>
>>>> Ping? I like your theory, can you provide some help with debugging this
>>>> further (to prove that your theory is correct ) ?
>>>
>>> It's been a very long time since I dealt with this and I was only on the
>>> periphery. You might be able to print the LVT entries from the APIC and
>>> see if any of them have an un-masked ExtINT delivery mode. You would need
>>> to do this very early before Linux modifies any values.
>>
>> I'm afraid I'm not familiar enough with the interrupt / APIC parts of
>> the kernel to do something like this myself.
>>
>>> Or you can report the issue to the OEM and have them check their BIOS
>>> code to see if they are doing this.
>>
>> I will try to go this route, but I'm not really hopeful that will
>> lead to a solution.
>
> A similar issue is also reported here:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1551605
>
> There are multiple people with different vectors (so likely / possibly
> different bugs) commenting on that bug, but I just got confirmation
> that the vector 55 issue is also happening on an Acer system with an AMD
> A8 processor (I suspect a Ryzen, but that still needs to be confirmed).
>
> So this seems to be a generic issue with (some) AMD laptops and
> not specific to one OEM.

I also see that comment 17 is for an Intel based machine, which to me
implies that it really is a BIOS issue.

Thanks,
Tom

>
> Regards,
>
> Hans