Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context
From: Steven Rostedt
Date: Sat Aug 22 2020 - 19:50:59 EST
On Sat, 22 Aug 2020 14:32:52 +0200
peterz@xxxxxxxxxxxxx wrote:
> On Fri, Aug 21, 2020 at 05:03:34PM -0400, Steven Rostedt wrote:
>
> > > Sigh. Is it too hard to make mutex_trylock() usable from interrupt
> > > context?
> >
> >
> > That's a question for Thomas and Peter Z.
>
> You should really know that too, the TL;DR answer is it's fundamentally
> buggered, can't work.
I knew there was an issue but I couldn't remember the reasoning, and
figured you could easily answer it without having to look back at the
code.
>
> The problem is that RT relies on being able to PI boost the mutex owner.
>
> ISTR we had a thread about all this last year or so, let me see if I can
> find that.
>
> Here goes:
>
> https://lkml.kernel.org/r/20191218135047.GS2844@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
>From this email:
> The problem happens when that owner is the idle task, this can happen
> when the irq/softirq hits the idle task, in that case the contending
> mutex_lock() will try and PI boost the idle task, and that is a big
> no-no.
What's wrong with priority boosting the idle task? It's not obvious,
and I can't find comments in the code saying it would be bad.
I looked around the code to see if I could find "why this is bad" but
couldn't find it. There's lots of places that say "Do not use
mutex_trylock in interrupt context, the implementation is not safe to
do so" but I can't find where it says "why" it is not safe to do so.
The idle task is not mentioned at all in rtmutex.c and not mentioned in
kernel/locking except for some comments about RCU in lockdep.
I see that in the idle code the prio_change method does a BUG(), but
there's no comment to say why it does so.
The commit that added that BUG, doesn't explain why it can't happen:
a8941d7ec8167 ("sched: Simplify the idle scheduling class")
I may have once known the rationale behind all this, but it's been a
long time since I worked on the PI code, and it's out of my cache.
-- Steve