Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

From: Ross Green
Date: Mon May 09 2016 - 22:36:46 EST


On Sun, May 8, 2016 at 1:25 AM, Paul E. McKenney
<paulmck@xxxxxxxxxxxxxxxxxx> wrote:
> On Fri, May 06, 2016 at 04:25:16PM +1000, Ross Green wrote:
>> On Sun, Apr 3, 2016 at 6:18 PM, Paul E. McKenney
>> <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
>> > On Thu, Mar 31, 2016 at 08:42:55AM -0700, Paul E. McKenney wrote:
>> >> On Wed, Mar 30, 2016 at 07:55:47AM -0700, Paul E. McKenney wrote:
>> >> > On Tue, Mar 29, 2016 at 06:49:08AM -0700, Paul E. McKenney wrote:
>> >> > > On Mon, Mar 28, 2016 at 05:28:14PM -0700, Paul E. McKenney wrote:
>> >> > > > On Mon, Mar 28, 2016 at 05:25:18PM -0700, Paul E. McKenney wrote:
>> >> > > > > On Mon, Mar 28, 2016 at 06:08:41AM -0700, Paul E. McKenney wrote:
>> >> > > > > > On Mon, Mar 28, 2016 at 08:25:47AM +0200, Peter Zijlstra wrote:
>> >> > > > > > > On Sun, Mar 27, 2016 at 02:06:41PM -0700, Paul E. McKenney wrote:
>> >> > > > >
>> >> > > > > [ . . . ]
>> >> > > > >
>> >> > > > > > > > OK, so I should instrument migration_call() if I get the repro rate up?
>> >> > > > > > >
>> >> > > > > > > Can do, maybe try the below first. (yes I know how long it all takes :/)
>> >> > > > > >
>> >> > > > > > OK, will run this today, then run calibration for last night's run this
>> >> > > > > > evening.
>> >> > >
>> >> > > And of 18 two-hour runs, there were five failures, or about 28%.
>> >> > > That said, I don't have even one significant digit on the failure rate,
>> >> > > as 5 of 18 is within the 95% confidence limits for a failure probability
>> >> > > as low as 12.5% and as high as 47%.
>> >> >
>> >> > And after last night's run, this is narrowed down to between 23% and 38%,
>> >> > which is close enough. Average is 30%, 18 failures in 60 runs.
>> >> >
>> >> > Next step is to test Peter's patch some more. Might take a couple of
>> >> > night's worth of runs to get statistical significance. After which
>> >> > it will be time to rebase to 4.6-rc1.
>> >>
>> >> And the first night was not so good: 6 failures out of 24 runs. Adding
>> >> this to the 1-of-10 earlier gets 7 failures of 34. Here are how things
>> >> stack up given the range of base failure estimates:
>> >>
>> >> Low 95% bound of 23%: 84% confidence.
>> >>
>> >> Actual measurement of 30%: 92% confidence.
>> >>
>> >> High 95% bound of 38%: 98% confidence.
>> >>
>> >> So there is still some chance that Peter's patch is helping. I will
>> >> run for one more evening, after which it will be time to move forward
>> >> to 4.6-rc1.
>> >
>> > And no luck reducing bounds. However, moving to 4.6-rc1 did get some
>> > of the trace_printk() to print. The ftrace_dump()s resulted in RCU
>> > CPU stall warnings, and the dumps were truncated due to test timeouts
>> > in my scripting. (I need to make my scripts more patient when they
>> > see an ftrace dump in progress, I guess.)
>> >
>> > Here are the results:
>> >
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.1.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.2.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.3.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.4.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.5.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.6.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.7.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.8.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.9.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.11.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.12.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.13.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.14.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.15.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.16.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.17.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.18.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.19.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.20.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.21.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.22.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.23.console.log.tgz
>> > http://www2.rdrop.com/users/paulmck/submission/TREE03.24.console.log.tgz
>> >
>> > The config is here:
>> >
>> > http://www2.rdrop.com/users/paulmck/submission/config.tgz
>> >
>> > More runs to measure 4.6-rc1 base error rate...
>> >
>> > Thanx, Paul
>> >
>> G'day Paul,
>>
>>
>> Thought i would update a few runs with the linux-4.6-rc kernels.
>>
>> I have attached log outputs through dmesg showing rcu_preempt stall warnings.
>>
>>
>> Thought it might be interesting for someone else to look at.
>>
>> Currently running linux-4.6-rc6 in testing.
>
> Thank you for sending these, I will look them over!
>
> Still working to reproduce this quickly enough to do real debug... :-/
>
> Thanx, Paul
>


I have noticed in the latest 4.6rc series of kernels there have been a
couple that I have run for over a week and not pulled any stall
warnings. Might be just a coincidence!

I typically run each rc release to the next release. So its around a 7
day stint of exercising each kernel. The method remains the same, run
series of short bench marks to see everything is behaving and then let
lie idle. The system is never stressed after the initial benchmarking.
I did notice that with one of the recent kernels there was a missing
hr_timer event. Unfortunately, due to my own tardiness I missed
collecting that log!!!! It also had the stall warning.

Anyway currently running linux-4.6-rc7, so far no stall warnings.

Every time I have tired running huge CPU workloads have not been able
to induce the stall warnings.

What ever is going on here certainly seems obscure!

Regards,

Ross Green