Re: [PATCH] arch: fix 'unexpected IRQ trap at vector' warnings

From: Enrico Weigelt, metux IT consult
Date: Wed Dec 16 2020 - 11:46:07 EST


On 15.12.20 23:12, Thomas Gleixner wrote:
> On Tue, Dec 15 2020 at 21:12, Enrico Weigelt wrote:
>> On 09.12.20 00:01, Thomas Gleixner wrote:
>>> 3) It's invoked from __handle_domain_irq() when the 'hwirq' which is
>>> handed in by the caller does not resolve to a mapped Linux
>>> interrupt which is pretty much the same as the x86 situation above
>>> in #1, but it prints useless data.
>>>
>>> It prints 'irq' which is invalid but it does not print the really
>>> interesting 'hwirq' which was handed in by the caller and did
>>> not resolve.
>>
>> I wouldn't say the irq-nr isn't interesting. In my particular case it
>> was quite what I've been looking for. But you're right, hwirq should
>> also be printed.
>
> The number is _not_ interesting in this case. It's useless because the
> function does:

Oh, I've mixed up the cases - I only had the other one, down below.

> irq = hwirq;
>
> if (lookup)
> irq = find_mapping(hwirq);
>
> if (!irq || irq >= nr_irqs)
> -> BAD

When exactly can that happen ? Only when some hardware sending an IRQ,
but no driver listening to it, or are there other cases ?

By the way: who's supposed to call that function ? Only irqchip's
(and the few soc specific 1st-level irq handlers) ? I'm asking, because
we have lots of gpio drivers, which have their own irq domain, but go
the generic_handle_irq() route. Same for some SOC-specific irqchips.

Should they also call handle_domain_irq() instead ?

> In both cases the only interesting information is that hwirq does not
> resolve to a valid Linux interrupt number and which hwirq number caused
> that.

Don't we also need know which irqchip the hwirq number belongs to ?

> If you look really then you find out that there is exactly _ONE_
> architecture which does anything else than incrementing a counter and/or
> printing stuff: X86, which has a big fat comment explaining why. The
> only way to ack an interrupt on X86 is to issue EOI on the local APIC,
> i.e. it does _not_ need any further information.

Yeah, found it :)

At this point I wonder whether the ack_APIC_irq() call could be done
somewhere further up in the call chain, eg. handle_irq() or
common_interrupt() ?

If that works, we IMHO could drop ack_bad_irq() completely (except for
the counter and printk, which we could consolidate elsewhere anyways)

>> ... rethinking this further ... shouldn't we also pass in even more data
>> (eg. irq_desc, irqchip, ...), so this function can check which hw to
>> actually talk to ?
>
> There are 3 ways to get there:
>
> 1) via dummy chip which obviously has no hardware associated

... which also calls print_irq_desc() ..

> 2) via handle_bad_irq() which prints the info already

print_irq_desc() doesn't seem to print the hwirq ... shall we fix this ?

> 3) __handle_domain_irq() which cannot print anything and obviously
> cannot figure out the hw to talk to because there is no irq
> descriptor associated.

Okay, what's the conclusion ? Drop printouts in the ack_bad_irq()'s ?

>>> 4) It's invoked from the dummy irq chip which is installed for a
>>> couple of truly virtual interrupts where the invocation of
>>> dummy_irq_chip::irq_ack() is indicating wreckage.
>>>
>>> In that case the Linux irq number is the thing which is printed.
>>>
>>> So no. It's not just inconsistent it's in some places outright
>>> wrong. What we really want is:
>>>
>>> ack_bad_irq(int hwirq, int virq)
>>
>> is 'int' correct here ?
>
> This was just for illustration.

Okay, thanks. Just discovered already have an irq_hw_number_t, which
doesn't seem to be used everywhere ... shall we fix that ?

>> OTOH: since both callers (dummychip.c, handle.c) already dump out before
>> ack_bad_irq(), do we need to print out anything at all ?
>
> Not all callers print something, but yes this could do with some general
> cleanup.

I've found three callers, only one (__handle_domain_irq() in irqdesc.c)
doesn't print out anything. I belive, adding a pr_warn() here and drop
all the printouts in ack_bad_irq()'s makes sense.

> The error counter is independent of that, but yes there is room for
> consolidation.

Ok, I've already started hacking a bit here: adding an atomic_t counter
in kernel/irq/handle.c and inline'd accessor functions in
include/asm-generic/irq.h (just feeling that accessors are a bit cleaner
than direct access). Would that be okay ?

By the way: I still wonder whether my case should have ever reached
ack_bad_irq().

The irqdescs had been allocated via devm_irq_alloc_descs(), and the
driver just called generic_handle_irq() with base irq + gpio nr.
So, IMHO it was a valid linux irq number, but no (explicit) handler.

I wonder whether ack'ing those virtual irqs onto hw could be harmful.


--mtx

--
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@xxxxxxxxx -- +49-151-27565287