Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17
From: Paul E. McKenney
Date: Sun Apr 03 2016 - 04:19:17 EST
On Thu, Mar 31, 2016 at 08:42:55AM -0700, Paul E. McKenney wrote:
> On Wed, Mar 30, 2016 at 07:55:47AM -0700, Paul E. McKenney wrote:
> > On Tue, Mar 29, 2016 at 06:49:08AM -0700, Paul E. McKenney wrote:
> > > On Mon, Mar 28, 2016 at 05:28:14PM -0700, Paul E. McKenney wrote:
> > > > On Mon, Mar 28, 2016 at 05:25:18PM -0700, Paul E. McKenney wrote:
> > > > > On Mon, Mar 28, 2016 at 06:08:41AM -0700, Paul E. McKenney wrote:
> > > > > > On Mon, Mar 28, 2016 at 08:25:47AM +0200, Peter Zijlstra wrote:
> > > > > > > On Sun, Mar 27, 2016 at 02:06:41PM -0700, Paul E. McKenney wrote:
> > > > >
> > > > > [ . . . ]
> > > > >
> > > > > > > > OK, so I should instrument migration_call() if I get the repro rate up?
> > > > > > >
> > > > > > > Can do, maybe try the below first. (yes I know how long it all takes :/)
> > > > > >
> > > > > > OK, will run this today, then run calibration for last night's run this
> > > > > > evening.
> > >
> > > And of 18 two-hour runs, there were five failures, or about 28%.
> > > That said, I don't have even one significant digit on the failure rate,
> > > as 5 of 18 is within the 95% confidence limits for a failure probability
> > > as low as 12.5% and as high as 47%.
> >
> > And after last night's run, this is narrowed down to between 23% and 38%,
> > which is close enough. Average is 30%, 18 failures in 60 runs.
> >
> > Next step is to test Peter's patch some more. Might take a couple of
> > night's worth of runs to get statistical significance. After which
> > it will be time to rebase to 4.6-rc1.
>
> And the first night was not so good: 6 failures out of 24 runs. Adding
> this to the 1-of-10 earlier gets 7 failures of 34. Here are how things
> stack up given the range of base failure estimates:
>
> Low 95% bound of 23%: 84% confidence.
>
> Actual measurement of 30%: 92% confidence.
>
> High 95% bound of 38%: 98% confidence.
>
> So there is still some chance that Peter's patch is helping. I will
> run for one more evening, after which it will be time to move forward
> to 4.6-rc1.
And no luck reducing bounds. However, moving to 4.6-rc1 did get some
of the trace_printk() to print. The ftrace_dump()s resulted in RCU
CPU stall warnings, and the dumps were truncated due to test timeouts
in my scripting. (I need to make my scripts more patient when they
see an ftrace dump in progress, I guess.)
Here are the results:
http://www2.rdrop.com/users/paulmck/submission/TREE03.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.1.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.2.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.3.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.4.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.5.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.6.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.7.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.8.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.9.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.11.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.12.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.13.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.14.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.15.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.16.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.17.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.18.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.19.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.20.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.21.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.22.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.23.console.log.tgz
http://www2.rdrop.com/users/paulmck/submission/TREE03.24.console.log.tgz
The config is here:
http://www2.rdrop.com/users/paulmck/submission/config.tgz
More runs to measure 4.6-rc1 base error rate...
Thanx, Paul