RE: [PATCH v3] rcu: Add a minimum time for marking boot as completed
From: Zhuo, Qiuxu
Date: Thu Mar 09 2023 - 10:31:15 EST
> From: Paul E. McKenney <paulmck@xxxxxxxxxx>
> [...]
> >
> > a's standard deviation is ~0.4.
> > b's standard deviation is ~0.5.
> >
> > a's average 9.0 is at the upbound of the standard deviation of b's [8.0, 9].
> > So, the measurements should be statistically significant to some degree.
>
> That single standard deviation means that you have 68% confidence that the
> difference is real. This is not far above the 50% leval of random noise.
> 95% is the lowest level that is normally considered to be statistically
> significant.
95% means there is no overlap between two standard deviations of a
and two standard deviations of b.
This relies on either much less noise during testing or a big enough
difference between a and b.
> > The calculated standard deviations are via:
> > https://www.gigacalculator.com/calculators/standard-deviation-calculat
> > or.php
>
> Fair enough. Formulas are readily available as well, and most spreadsheets
> support standard deviation.
>
> [...]
>
> > > Why don't you try applying this approach to the new data? You will
> > > need the general binomial formula.
> >
> > Thank you Paul for the suggestion.
> > I just tried it, but not sure whether my analysis was correct ...
> >
> > Analysis 1:
> > a's median is 8.9.
>
> I get 8.95, which is the average of the 24th and 25th members of a in
> numerical order.
Yes, it should be 8.95. Thanks for correcting me.
> > 35/48 b's data points are less than 0.1 less than a's median.
> > For a's binomial distribution P(X >= 35) = 0.1%, where p=0.5.
> > So, we have strong confidence that b is 100ms faster than a.
>
> I of course get quite a bit stronger confidence, but your 99.9% is good
> enough. And I get even stronger confidence going in the other direction.
> However, the fact that a's median varies from 8.7 in the old experiment to
> 8.95 in this experiment does give some pause. These are after all supposedly
> drawn from the same distribution. Or did you use a different machine or
> different OS version or some such in the two sets of measurements?
> Different time of day and thus different ambient temperature, thus different
> CPU clock frequency?
All the testing setups were identical except for the testing time.
Old a median : 8.7
New a median : 8.95
Old b median : 8.2
New b median : 8.45
I'm a bit surprised that both new medians are exactly greater 0.25 more than
the old medians. Coincidence?
> Assuming identical test setups, let's try the old value of 8.7 from old a to new
> b. There are 14 elements in new b greater than 8.6, for a probability of
> 0.17%, or about 98.3% significance. This is still OK.
>
> In contrast, the median of the old b is 8.2, which gives extreme confidence.
> So let's be conservative and use the large-set median.
>
> In real life, additional procedures would be needed to estimate the
> confidence in the median, which turns oout to be nontrivial. When I apply
Luckily, I could just simply pick up the medians in numerical order in this case. ;-)
> this sort of technique, I usually have all data from each sample being on one
> side of the median of the other, which simplifies things. ;-)
I like all data points are on one side of the median of the other ;-)
But this also relies on either much less noise during testing or a big enough
difference between a and b, right?
> The easiest way to estimate bounds on the median is to "bootstrap", but that
> works best if you have 1000 samples and can randomly draw 1000 sub-
> samples each of size 10 from the larger sample and compute the median of
> each. You can sort these medians and obtain a cumulative distribution.
Good to know "bootstap".
> But you have to have an extremely good reason to collect data from 1000
> boots, and I don't believe we have that good of a reason.
>
1000 boots, Oh my ...
No. No. I don't have a good reason for that ;-)
> > Analysis 2:
> > a's median - 0.4 = 8.9 - 0.4 = 8.5.
> > 24/48 b's data points are less than 0.4 less than a's median.
> > The probability that a's data points are less than 8.5 is p = 7/48
> > = 0.1458
> This is only 85.4% significant, so...
>
> > For a's binomial distribution P(X >= 24) = 0.0%, where p=0.1458.
> > So, looks like we have confidence that b is 400ms faster than a.
>
> ...we really cannot say anything about 400ms faster. Again, you need 95%
> and preferably 99% to really make any sort of claim. You probably need
> quite a few more samples to say much about 200ms, let alone 400ms.
OK. Thanks for correcting me.
>
> Plus, you really should select the speedup and only then take the
> measurements. Otherwise, you end up fitting noise.
>
> However, assuming identical tests setups, you really can calculate the median
> from the full data set.
>
> > The calculated cumulative binomial distributions P(X) is via:
> >
> > https://www.gigacalculator.com/calculators/binomial-probability-calcul
> > ator.php
>
> The maxima program's binomial() function agrees with it, so good. ;-)
>
> > I apologize if this analysis/discussion bored some of you. ;-)
>
> Let's just say that it is a lot simpler when you are measuring larger
> differences in data with tighter distributions. Me, I usually just say "no" to
> drawing any sort of conclusion from data sets that overlap this much.
> Instead, I might check to see if there is some random events adding noise to
> the boot duration, eliminate that, and hopefully get data that is easier to
> analyze.
Agree.
> But I am good with the 98.3% confidence in a 100ms improvement.
>
> So if Joel wishes to make this point, he should feel free to take both of your
> datasets and use the computation with the worse mean.
Thank you so much Paul for your patience and detailed comments.
-Qiuxu