Re: [patch 0/4] timer/nohz: Fix timer/nohz woes

From: Paul E. McKenney
Date: Fri Jan 05 2018 - 14:41:22 EST


On Sat, Dec 23, 2017 at 05:29:24PM -0800, Paul E. McKenney wrote:
> On Sat, Dec 23, 2017 at 05:21:20PM -0800, Paul E. McKenney wrote:
> > On Fri, Dec 22, 2017 at 09:09:07AM -0800, Paul E. McKenney wrote:
> > > On Fri, Dec 22, 2017 at 03:51:11PM +0100, Thomas Gleixner wrote:
> > > > Paul was observing weird stalls which are hard to reproduce and decode. We
> > > > were finally able to reproduce and decode the wreckage on RT.
> > > >
> > > > The following series addresses the issues and hopefully nails the root
> > > > cause completely.
> > > >
> > > > Please review carefully and expose it to the dreaded rcu torture tests
> > > > which seem to be the only way to trigger it.
> > >
> > > Best Christmas present ever, thank you!!!
> > >
> > > Just started up three concurrent 10-hour runs of the infamous rcutorture
> > > TREE01 scenario, and will let you know how it goes!
> >
> > Well, I messed up the first test and then reran it. Which had the benefit
> > of giving me a baseline. The rerun (with all four patches) produced
> > failures, so I ran it again with an additional patch of mine. I score
> > these tests by recording the time at first failure, or, if there is no
> > failure, the duration of the test. Summing the values gives the score.
> > And here are the scores, where 30 is a perfect score:
>
> Sigh. They were five-hour tests, not ten-hour tests.
>
> 1. Baseline: 3.0+2.5+5=10.5
>
> 2. Four patches from Anna-Marie and Thomas: 5+2.7+1.7=9.4
>
> 3. Ditto plus the patch below: 5+4.3+5=14.3
>
> Oh, and the reason for my suspecting that #2 is actually an improvement
> over #1 is that my patch by itself produced a very small improvement in
> reliability. This leads to the hypothesis that #2 really is helping out
> in some way or another.

But after more than 1,000 hours of test runs, split roughly evenly
among the above three scenarios, there is no statistically significant
difference in error rate among them. This means that there is some
other bug lurking somewhere, and having the same appearance (lost timer).
Were you guys ever able to reproduce this via rcutorture?

More details below.

Thanx, Paul

------------------------------------------------------------------------

I ran sets of three-hour runs. I took the time of first error (if
any), and excluded the rest of that particular three-hour run from
consideration. This means that if a given run failed at two hours,
we add one to the "errors" column and two to the "duration" column.
Runs without errors contributed three hours "duration" column, but of
course nothing to the "errors" column. An overall errors/hour rate
is then computed for each scenario:

1. Baseline: (378 hours total runtime)
74 errors in 218.8 hours error-free runtime, 0.338 errors/hour.

2. Four patches from Anna-Marie and Thomas: (315 hours total runtime)
65 errors in 195.2 hours error-free runtime, 0.333 errors/hour.

3. Ditto plus the patch below: (315 hours total runtime)
66 errors in 179.4 hours error-free runtime, 0.368 errors/hour.

Applying Poisson statistics shows that we need to drop below 0.270
errors/hour to assert that a fix had a 95% chance of having reduced the
error rate, and none of the runs achieve this level of improvement.
In fact, even the least probable scenario had more than a 25% probability
of happening by chance.

These calculations were carried out using maxima:

load(distrib);
bfloat(cdf_poisson(59,218.8*0.338));
(%o11) 4.267467688401431b-2

This is 4.2% probability of the result having happened due to random
chance, just a bit better than 95% confidence.

bfloat(cdf_poisson(60,218.8*0.338));
(%o8) 5.525461180734715b-2

This is 5.5% probability of the result having happened due to random
chance, just a bit worse than 95% confidence. So, dividing 59 by the
218.8 hours of error-free runs on baseline gives the aforementioned
0.270 errors/hour.