Re: [PATCH v3] rcu: Add a minimum time for marking boot as completed

From: Paul E. McKenney
Date: Thu Mar 09 2023 - 16:53:50 EST


On Thu, Mar 09, 2023 at 03:17:09PM +0000, Zhuo, Qiuxu wrote:
> > From: Paul E. McKenney <paulmck@xxxxxxxxxx>
> > [...]
> > >
> > > a's standard deviation is ~0.4.
> > > b's standard deviation is ~0.5.
> > >
> > > a's average 9.0 is at the upbound of the standard deviation of b's [8.0, 9].
> > > So, the measurements should be statistically significant to some degree.
> >
> > That single standard deviation means that you have 68% confidence that the
> > difference is real. This is not far above the 50% leval of random noise.
> > 95% is the lowest level that is normally considered to be statistically
> > significant.
>
> 95% means there is no overlap between two standard deviations of a
> and two standard deviations of b.
>
> This relies on either much less noise during testing or a big enough
> difference between a and b.
>
> > > The calculated standard deviations are via:
> > > https://www.gigacalculator.com/calculators/standard-deviation-calculat
> > > or.php
> >
> > Fair enough. Formulas are readily available as well, and most spreadsheets
> > support standard deviation.
> >
> > [...]
> >
> > > > Why don't you try applying this approach to the new data? You will
> > > > need the general binomial formula.
> > >
> > > Thank you Paul for the suggestion.
> > > I just tried it, but not sure whether my analysis was correct ...
> > >
> > > Analysis 1:
> > > a's median is 8.9.
> >
> > I get 8.95, which is the average of the 24th and 25th members of a in
> > numerical order.
>
> Yes, it should be 8.95. Thanks for correcting me.
>
> > > 35/48 b's data points are less than 0.1 less than a's median.
> > > For a's binomial distribution P(X >= 35) = 0.1%, where p=0.5.
> > > So, we have strong confidence that b is 100ms faster than a.
> >
> > I of course get quite a bit stronger confidence, but your 99.9% is good
> > enough. And I get even stronger confidence going in the other direction.
> > However, the fact that a's median varies from 8.7 in the old experiment to
> > 8.95 in this experiment does give some pause. These are after all supposedly
> > drawn from the same distribution. Or did you use a different machine or
> > different OS version or some such in the two sets of measurements?
> > Different time of day and thus different ambient temperature, thus different
> > CPU clock frequency?
>
> All the testing setups were identical except for the testing time.
>
> Old a median : 8.7
> New a median : 8.95
>
> Old b median : 8.2
> New b median : 8.45
>
> I'm a bit surprised that both new medians are exactly greater 0.25 more than
> the old medians. Coincidence?

Possibly some semi-rare race condition makes boot take longer, and 48
boots has a higher probability of getting more of them? But without
analyzing the boot sequence, your guess is as good as mine.

> > Assuming identical test setups, let's try the old value of 8.7 from old a to new
> > b. There are 14 elements in new b greater than 8.6, for a probability of
> > 0.17%, or about 98.3% significance. This is still OK.
> >
> > In contrast, the median of the old b is 8.2, which gives extreme confidence.
> > So let's be conservative and use the large-set median.
> >
> > In real life, additional procedures would be needed to estimate the
> > confidence in the median, which turns oout to be nontrivial. When I apply
>
> Luckily, I could just simply pick up the medians in numerical order in this case. ;-)
>
> > this sort of technique, I usually have all data from each sample being on one
> > side of the median of the other, which simplifies things. ;-)
>
> I like all data points are on one side of the median of the other ;-)
>
> But this also relies on either much less noise during testing or a big enough
> difference between a and b, right?

Yes, life is indeed *much* easier when there is less noise or larger
differences. ;-)

> > The easiest way to estimate bounds on the median is to "bootstrap", but that
> > works best if you have 1000 samples and can randomly draw 1000 sub-
> > samples each of size 10 from the larger sample and compute the median of
> > each. You can sort these medians and obtain a cumulative distribution.
>
> Good to know "bootstap".
>
> > But you have to have an extremely good reason to collect data from 1000
> > boots, and I don't believe we have that good of a reason.
> >
>
> 1000 boots, Oh my ...
> No. No. I don't have a good reason for that ;-)
>
> > > Analysis 2:
> > > a's median - 0.4 = 8.9 - 0.4 = 8.5.
> > > 24/48 b's data points are less than 0.4 less than a's median.
> > > The probability that a's data points are less than 8.5 is p = 7/48
> > > = 0.1458
> > This is only 85.4% significant, so...
> >
> > > For a's binomial distribution P(X >= 24) = 0.0%, where p=0.1458.
> > > So, looks like we have confidence that b is 400ms faster than a.
> >
> > ...we really cannot say anything about 400ms faster. Again, you need 95%
> > and preferably 99% to really make any sort of claim. You probably need
> > quite a few more samples to say much about 200ms, let alone 400ms.
>
> OK. Thanks for correcting me.
>
> > Plus, you really should select the speedup and only then take the
> > measurements. Otherwise, you end up fitting noise.
> >
> > However, assuming identical tests setups, you really can calculate the median
> > from the full data set.
> >
> > > The calculated cumulative binomial distributions P(X) is via:
> > >
> > > https://www.gigacalculator.com/calculators/binomial-probability-calcul
> > > ator.php
> >
> > The maxima program's binomial() function agrees with it, so good. ;-)
> >
> > > I apologize if this analysis/discussion bored some of you. ;-)
> >
> > Let's just say that it is a lot simpler when you are measuring larger
> > differences in data with tighter distributions. Me, I usually just say "no" to
> > drawing any sort of conclusion from data sets that overlap this much.
> > Instead, I might check to see if there is some random events adding noise to
> > the boot duration, eliminate that, and hopefully get data that is easier to
> > analyze.
>
> Agree.
>
> > But I am good with the 98.3% confidence in a 100ms improvement.
> >
> > So if Joel wishes to make this point, he should feel free to take both of your
> > datasets and use the computation with the worse mean.
>
> Thank you so much Paul for your patience and detailed

And thank you for bearing with me.

Thanx, Paul