Re: [RESEND PATCH 2/2] sched/fair: Optimize __update_sched_avg()

From: Yuyang Du
Date: Thu Mar 30 2017 - 22:44:52 EST


Hi Paul,

On Thu, Mar 30, 2017 at 04:21:08AM -0700, Paul Turner wrote:
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -2767,7 +2767,7 @@ static const u32 __accumulated_sum_N32[]
> > * Approximate:
> > * val * y^n, where y^32 ~= 0.5 (~1 scheduling period)
> > */
> > -static __always_inline u64 decay_load(u64 val, u64 n)
> > +static u64 decay_load(u64 val, u64 n)
> > {
> > unsigned int local_n;
> >
> > @@ -2795,32 +2795,113 @@ static __always_inline u64 decay_load(u6
> > return val;
> > }
> >
> > -/*
> > - * For updates fully spanning n periods, the contribution to runnable
> > - * average will be: \Sum 1024*y^n
> > - *
> > - * We can compute this reasonably efficiently by combining:
> > - * y^PERIOD = 1/2 with precomputed \Sum 1024*y^n {for n <PERIOD}
> > - */
> > -static u32 __compute_runnable_contrib(u64 n)
> > +static u32 __accumulate_sum(u64 periods, u32 period_contrib, u32 remainder)
>
> - The naming here is really ambiguous:
> "__accumulate_sum" -> "__accumulate_pelt_segments"?

Yes, sounds better :)

> - Passing in "remainder" seems irrelevant to the sum accumulation. It would be
> more clear to handle it from the caller.

It's relevant, because it "may" be c3. But maybe "remainder" isn't a good name,
as it's actually:

delta %= 1024, in which delta is (now - last_period_boundary).

> >
> > {
> > - u32 contrib = 0;
> > + u32 c1, c2, c3 = remainder; /* y^0 == 1 */
> >
> > - if (likely(n <= LOAD_AVG_PERIOD))
> > - return runnable_avg_yN_sum[n];
> > - else if (unlikely(n >= LOAD_AVG_MAX_N))
> > + if (!periods)
> > + return remainder - period_contrib;
>
> This is super confusing. It only works because remainder already had
> period_contrib aggregated _into_ it. We're literally computing:
> remainder + period_contrib - period_contrib

Sort of, but we're just reusing (delta += period_contrib), which is the simple
way to compute periods. And we carry the hope that we passed (delta %= 1024) as
the c3 if (periods). Or we could keep the real "delta" only for the !0 periods
case, which is good too. Yes, making the code as compact as possible produces
confusion, in which direction I have gone too much.

> We should just not call this in the !periods case and handle the remainder
> below.

It'd be clear to stick to __accumulate_pelt_segments() being c1 + c2 + c3,
described as step 2.

> > +
> > + if (unlikely(periods >= LOAD_AVG_MAX_N))
> > return LOAD_AVG_MAX;
> >
> > - /* Since n < LOAD_AVG_MAX_N, n/LOAD_AVG_PERIOD < 11 */
> > - contrib = __accumulated_sum_N32[n/LOAD_AVG_PERIOD];
> > - n %= LOAD_AVG_PERIOD;
> > - contrib = decay_load(contrib, n);
> > - return contrib + runnable_avg_yN_sum[n];
> > + /*
> > + * c1 = d1 y^(p+1)
> > + */
> > + c1 = decay_load((u64)(1024 - period_contrib), periods);
> > +
> > + periods -= 1;
> > + /*
> > + * For updates fully spanning n periods, the contribution to runnable
> > + * average will be:
> > + *
> > + * c2 = 1024 \Sum y^n
> > + *
> > + * We can compute this reasonably efficiently by combining:
> > + *
> > + * y^PERIOD = 1/2 with precomputed 1024 \Sum y^n {for: n < PERIOD}
> > + */
> > + if (likely(periods <= LOAD_AVG_PERIOD)) {
> > + c2 = runnable_avg_yN_sum[periods];
> > + } else {
> > + c2 = __accumulated_sum_N32[periods/LOAD_AVG_PERIOD];
>
> This still wants the comment justifying why we can't overflow.

Yes, and it'd be:

/*
* We can't overflow because we would have returned if
* periods >= LOAD_AVG_MAX_N.
*/

> > + periods %= LOAD_AVG_PERIOD;
> > + c2 = decay_load(c2, periods);
> > + c2 += runnable_avg_yN_sum[periods];
> > + }
> > +
> > + return c1 + c2 + c3;
> > }
> >
> > #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
> >
> > /*
> > + * Accumulate the three separate parts of the sum; d1 the remainder
> > + * of the last (incomplete) period, d2 the span of full periods and d3
> > + * the remainder of the (incomplete) current period.
> > + *
> > + * d1 d2 d3
> > + * ^ ^ ^
> > + * | | |
> > + * |<->|<----------------->|<--->|
> > + * ... |---x---|------| ... |------|-----x (now)
> > + *
> > + * p
> > + * u' = (u + d1) y^(p+1) + 1024 \Sum y^n + d3 y^0
> > + * n=1
> > + *
> > + * = u y^(p+1) + (Step 1)
> > + *
> > + * p
> > + * d1 y^(p+1) + 1024 \Sum y^n + d3 y^0 (Step 2)
> > + * n=1
> > + */
> > +static __always_inline u32
> > +accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
> > + unsigned long weight, int running, struct cfs_rq *cfs_rq)
> > +{
> > + unsigned long scale_freq, scale_cpu;
> > + u64 periods;
> > + u32 contrib;
> > +
> > + scale_freq = arch_scale_freq_capacity(NULL, cpu);
> > + scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> > +
> > + delta += sa->period_contrib;
> > + periods = delta / 1024; /* A period is 1024us (~1ms) */
> > +
> > + if (periods) {
> > + sa->load_sum = decay_load(sa->load_sum, periods);
> > + if (cfs_rq) {
> > + cfs_rq->runnable_load_sum =
> > + decay_load(cfs_rq->runnable_load_sum, periods);
> > + }
> > + sa->util_sum = decay_load((u64)(sa->util_sum), periods);
>
> Move step 2 here, handle remainder below.

See above.

Thanks,
Yuyang