Re: [RESEND PATCH 2/2] sched/fair: Optimize __update_sched_avg()

From: Peter Zijlstra
Date: Thu Mar 30 2017 - 08:17:21 EST


On Thu, Mar 30, 2017 at 04:21:08AM -0700, Paul Turner wrote:
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -2767,7 +2767,7 @@ static const u32 __accumulated_sum_N32[]
> > * Approximate:
> > * val * y^n, where y^32 ~= 0.5 (~1 scheduling period)
> > */
> > -static __always_inline u64 decay_load(u64 val, u64 n)
> > +static u64 decay_load(u64 val, u64 n)
> > {
> > unsigned int local_n;
> >
> > @@ -2795,32 +2795,113 @@ static __always_inline u64 decay_load(u6
> > return val;
> > }
> >
> > -/*
> > - * For updates fully spanning n periods, the contribution to runnable
> > - * average will be: \Sum 1024*y^n
> > - *
> > - * We can compute this reasonably efficiently by combining:
> > - * y^PERIOD = 1/2 with precomputed \Sum 1024*y^n {for n <PERIOD}
> > - */
> > -static u32 __compute_runnable_contrib(u64 n)
> > +static u32 __accumulate_sum(u64 periods, u32 period_contrib, u32 remainder)
>
> - The naming here is really ambiguous:
> "__accumulate_sum" -> "__accumulate_pelt_segments"?

OK, I did struggle with that a bit too but failed to improve, I'll change it.

> - Passing in "remainder" seems irrelevant to the sum accumulation. It would be
> more clear to handle it from the caller.

Well, this way we have all 3 delta parts in one function. I'll try it
and see what it looks like though.

> >
> > {
> > - u32 contrib = 0;
> > + u32 c1, c2, c3 = remainder; /* y^0 == 1 */
> >
> > - if (likely(n <= LOAD_AVG_PERIOD))
> > - return runnable_avg_yN_sum[n];
> > - else if (unlikely(n >= LOAD_AVG_MAX_N))
> > + if (!periods)
> > + return remainder - period_contrib;
>
> This is super confusing. It only works because remainder already had
> period_contrib aggregated _into_ it. We're literally computing:
> remainder + period_contrib - period_contrib

Correct; although I didn't find it too confusing. Could be because I'd
been staring at this code for a few hours though.

> We should just not call this in the !periods case and handle the remainder
> below.

I'll change it see what it looks like.

> > +
> > + if (unlikely(periods >= LOAD_AVG_MAX_N))
> > return LOAD_AVG_MAX;
> >
> > - return contrib + runnable_avg_yN_sum[n];
> > + /*
> > + * c1 = d1 y^(p+1)
> > + */
> > + c1 = decay_load((u64)(1024 - period_contrib), periods);
> > +
> > + periods -= 1;
> > + /*
> > + * For updates fully spanning n periods, the contribution to runnable
> > + * average will be:
> > + *
> > + * c2 = 1024 \Sum y^n
> > + *
> > + * We can compute this reasonably efficiently by combining:
> > + *
> > + * y^PERIOD = 1/2 with precomputed 1024 \Sum y^n {for: n < PERIOD}
> > + */
> > + if (likely(periods <= LOAD_AVG_PERIOD)) {
> > + c2 = runnable_avg_yN_sum[periods];
> > + } else {
> > + c2 = __accumulated_sum_N32[periods/LOAD_AVG_PERIOD];
>
> This still wants the comment justifying why we can't overflow.

You mean why periods/LOAD_AVG_PERIOD < ARRAY_SIZE(__accumulated_sum_N32)
? Or something else?

> > + periods %= LOAD_AVG_PERIOD;
> > + c2 = decay_load(c2, periods);
> > + c2 += runnable_avg_yN_sum[periods];
> > + }
> > +
> > + return c1 + c2 + c3;
> > }


> > + /*
> > + * Step 2
> > + */
> > + delta %= 1024;
> > + contrib = __accumulate_sum(periods, sa->period_contrib, delta);
> > + sa->period_contrib = delta;
> > +
> > + contrib = cap_scale(contrib, scale_freq);
> > + if (weight) {
> > + sa->load_sum += weight * contrib;
>
> Is this correct in the iterated periods > LOAD_AVG_MAX_N case?
> I don't think the decay above is guaranteed to return these to zero.

Ah!

Indeed, so decay_load() needs LOAD_AVG_PERIOD * 63 before it truncates
to 0, because every LOAD_AVG_PERIOD we half the value; loose 1 bit; so
63 of those and we're 0.

But __accumulate_sum() OTOH returns LOAD_AVG_MAX after only
LOAD_AVG_MAX_N, which < LOAD_AVG_PERIOD * 63.

So yes, combined we exceed LOAD_AVG_MAX, which is bad. Let me think what
to do about that.

> > + if (cfs_rq)
> > + cfs_rq->runnable_load_sum += weight * contrib;
> > + }
> > + if (running)
> > + sa->util_sum += contrib * scale_cpu;
> > +
> > + return periods;
> > +}