Re: Linux 4.9.6 ( Restore IO-APIC irq_chip retrigger callback , breaks my box )

From: Linus Torvalds
Date: Mon Feb 06 2017 - 18:06:57 EST


On Mon, Feb 6, 2017 at 9:30 AM, Gabriel C <nix.or.die@xxxxxxxxx> wrote:
>
> Somewhat late , however I didn't tested 4.9.6 but jumped from 4.9.5 to 4.9.7
> and found out by box won't boot anymore.
>
> It hangs early and freeze with a lot RCU warnings.
> Since I cannot setup a netconsole right now I cannot post the errors ,
> really sorry.
>
> ( but I could make a picture if needed )
>
> I bisected it down to :
>
>> Ruslan Ruslichenko (1):
>> x86/ioapic: Restore IO-APIC irq_chip retrigger callback

Ok, it's

020eb3daaba2 ("x86/ioapic: Restore IO-APIC irq_chip retrigger callback")

in mainline.

> Reverting this one fixes the problem for me..

Since that came in rather late, I suspect we'll have to revert for
now. The thing it fixes has been around for almost two years, so it
can't be as serious a problem as the fix itself ended up being.

Thomas?

That said, it also strikes me that the implicated
irq_chip_retrigger_hierarchy() function looks really very suspicious
indeed.

Most of the other users don't seem to traverse the parent all the way
until they find something. They just do the operation in the parent,
and if the parent needs it, it might then do it in _its_ parent and so
on.

And the compiler is able to turn the parent call into a tail call so
it doesn't cause a stack use explosion even if the parenthood chains
end up being pretty deep.

So I'm wondering if that for-loop triggers a stack overflow on your
setup somehow, just because that irq_retrigger() call is now truly
recursive, and hasn't been turned into tail-calls.

But for now, I'd be inclined to just revert it unless somebody has a
"Duh!" moment and can tell me what's wrong with that commit with an
obvious fix.

Comments?

Linus