Re: RCU stall when using function_graph
From: Paul E. McKenney
Date: Wed Aug 16 2017 - 12:32:40 EST
On Wed, Aug 16, 2017 at 10:04:21AM -0400, Steven Rostedt wrote:
> On Wed, 16 Aug 2017 10:42:15 +0200
> Daniel Lezcano <daniel.lezcano@xxxxxxxxxx> wrote:
>
> > Hi Steven,
> >
> >
> > On 15/08/2017 15:29, Steven Rostedt wrote:
> > >
> > > [ I'm back from vacation! ]
> >
> > Did you get the tapes? :)
>
> Yes, but nothing in them would cause the reputation of the POTUS to
> become any worse than it already is.
>
> >
> > > On Wed, 9 Aug 2017 17:51:33 +0200
> > > Daniel Lezcano <daniel.lezcano@xxxxxxxxxx> wrote:
> > >
> > >> Well, may be the instruction pointer thing is not a good idea.
> > >>
> > >> I learnt from this experience, an overloaded kernel with a lot of
> > >> interrupts can hang the console and issue RCU stall.
> > >>
> > >> However, someone else can face the same situation. Even if he reads the
> > >> RCU/stallwarn.txt documentation, it will be hard to figure out the issue.
> > >>
> > >> A message telling the grace period can't be reached because we are too
> > >> busy processing interrupts would have helped but I understand it is not
> > >> easy to implement.
> > >
> > > What if the stall code triggered an irqwork first? The irqwork would
> > > trigger as soon as interrupts were enabled again (or at the next tick,
> > > depending on the arch), and then it would know that RCU stalled due to
> > > an irq storm if the irqwork is being hit.
> >
> > Is that condition enough to tell the CPU is over utilized by the
> > interrupts handling?
> >
> > And I'm wondering if it wouldn't make sense to have this detection in
> > the irq code. With or without the RCU stall warning kernel option set,
> > the irq framework will be warning about this situation. If the RCU stall
> > option is set, that will issue a second message. It will be easy to do
> > the connection between the first message and the second one, no ?
>
> The thing is, the RCU code keeps track of the state of progress, I
> don't believe the interrupt code does. It just worries about handling
> interrupts. I'm not excited about adding infrastructure to the
> interrupt code to do accounting of IRQ storms.
>
> On the other hand, the RCU code already does this. If it notices a
> stall, it can trigger a irq_work and wait a little more. If the
> irq_work doesn't fire, then it can do the normal RCU stall message. But
> if the irq_work does fire, and the RCU progress still hasn't moved
> forward, then it would be able to say this is due to an IRQ storm and
> produce a better error message.
Let me see if I understand you... About halfway to the stall limit,
RCU triggers an irq_work (on each CPU that has not yet passed through
a quiescent state, IPIing them in turn?), and if the irq_work has
not completed by the end of the stall limit, RCU adds that to its
stall-warning message.
Or am I missing something here?
Thanx, Paul