Re: [PATCH] sched/fair: Optimize sum computation with a lookup table

From: Peter Zijlstra
Date: Fri Apr 08 2016 - 06:44:43 EST


On Fri, Apr 08, 2016 at 10:07:20AM +0800, Yuyang Du wrote:
> __compute_runnable_contrib() uses a loop to compute sum, whereas a
> table loopup can do it faster in a constant time.

> - /* Compute \Sum k^n combining precomputed values for k^i, \Sum k^j */
> - do {
> - contrib /= 2; /* y^LOAD_AVG_PERIOD = 1/2 */
> - contrib += runnable_avg_yN_sum[LOAD_AVG_PERIOD];
> -
> - n -= LOAD_AVG_PERIOD;
> - } while (n > LOAD_AVG_PERIOD);
> -
> + /* Since n < LOAD_AVG_MAX_N, n/LOAD_AVG_PERIOD < 11 */
> + contrib = __accumulated_sum_N32[n/LOAD_AVG_PERIOD];
> + n %= LOAD_AVG_PERIOD;
> contrib = decay_load(contrib, n);
> return contrib + runnable_avg_yN_sum[n];

You replace a simple loop with a DIV instruction and a potential extra
cachemiss.

Is that really faster? What is the median 'n' for which we run that
loop? IOW how many loops do we normally do?

And remember that while recent Intel chips are really good at divisions,
not everybody is (and even then they're still slow).