Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

From: Paul E. McKenney
Date: Wed Feb 17 2016 - 18:19:56 EST


On Wed, Feb 17, 2016 at 12:28:29PM -0800, Paul E. McKenney wrote:
> On Wed, Feb 17, 2016 at 08:45:54PM +0100, Peter Zijlstra wrote:
> > On Wed, Feb 17, 2016 at 11:28:17AM -0800, Paul E. McKenney wrote:
> > > On Tue, Feb 16, 2016 at 09:45:49PM -0800, Paul E. McKenney wrote:
> > > > On Tue, Feb 09, 2016 at 09:11:55PM +1100, Ross Green wrote:
> > > > > Continued testing with the latest linux-4.5-rc3 release.
> > > > >
> > > > > Please find attached a copy of traces from dmesg:
> > > > >
> > > > > There is a lot more debug and trace data so hopefully this will shed
> > > > > some light on what might be happening here.
> > > > >
> > > > > My testing remains run a series of simple benchmarks, let that run to
> > > > > completion and then leave the system idle away with just a few daemons
> > > > > running.
> > > > >
> > > > > the self detected stalls in this instance turned up after a days run time.
> > > > > There were NO heavy artificial computational loads on the machine.
> > > >
> > > > It does indeed look quiet on that dmesg for a good long time.
> > > >
> > > > The following insanely crude not-for-mainline hack -might- be producing
> > > > good results in my testing. It will take some time before I can claim
> > > > statistically different results. But please feel free to give it a go
> > > > in the meantime. (Thanks to Al Viro for pointing me in this direction.)
> >
> > Your case was special in that is was hotplug triggering it, right?
>
> Yes, it has thus far only shown up with CPU hotplug enabled.
>
> > I was auditing the hotplug paths involved when I fell ill two weeks ago,
> > and have not really made any progress on that because of that :/
>
> I have always said that being sick is bad for one's health, but I didn't
> realize that it could be bad for the kernel's health as well. ;-)
>
> > I'll go have another look, I had a vague feeling for a race back then,
> > lets see if I can still remember how..
>
> I believe that I can -finally- get an ftrace_dump() to happen within
> 10-20 milliseconds of the problem, which just might be soon enough
> after the problem to gather some useful information. I am currently
> testing this theory with "ftrace trace_event=sched_waking,sched_wakeup"
> boot arguments on a two-hour run.

And apparently another way to greatly reduce the probability of this
bug occurring is to enable ftrace. :-/

Will try longer runs.

Thanx, Paul

> If this works out, what would be a useful set of trace events for me
> to capture?
>
> Thanx, Paul