Re: [PATCH v3 05/12] sched/fair: Optimize __update_sched_avg()
From: Morten Rasmussen
Date: Tue May 10 2016 - 04:45:50 EST
On Wed, May 04, 2016 at 04:02:46AM +0800, Yuyang Du wrote:
> __update_sched_avg() has these steps:
>
> 1. add the remainder of the last incomplete period
> 2. decay old sum
> 3. accumulate new sum in full periods since last_update_time
> 4. add the current incomplete period
> 5. update averages
>
> Previously, we separately computed steps 1, 3, and 4, leading to
> each one of them ugly in codes and costly in overhead.
>
> For example:
>
> c1 c3 c4
> ^ ^ ^
> | | |
> |<->|<----------------->|<--->|
> ... |---x---|------| ... |------|-----x (now)
>
> c1, c3, and c4 are the accumulated (meanwhile decayed) contributions
> in timing aspect of steps 1, 3, and 4 respectively.
>
> Then the accumulated contribution to load_sum, for example, is:
>
> contrib = c1 * weight * freq_scaled;
> contrib += c3 * weight * freq_scaled;
> contrib += c4 * weight * freq_scaled;
>
> Obviously, we can optimize the above as:
>
> contrib = c1 + c3 + c4;
> contrib *= weight * freq_scaled;
This isn't obvious to me. After spending quite a bit of time figuring
what your code actually does, there is more to it than what you describe
here. As your label names suggest, you don't consider what happens in
step 2 where contrib is decayed. When and how the individual bits are
decayed is essential information.
Your patch partially moves step 2 (on your list above) before step 1.
So it becomes:
a. decay old sum
b. compute the contribution up to the first 1ms boundary (c1)
c. decay c1 to get c1'
d. accumulate the full periods (c3) adding them to c1'
e. add remaining time up until now (c4) to contrib (c3+c1').
f. scale by weight and arch_scale_{freq,cpu}_capacity() functions and
add to sum.
The basic difference in the computation is that you consolidate the
scaling into one operation instead of three at the cost of decaying
twice instead of once. The net result is saving a few multiplications.
I would recommend that this is made very clear. Deriving the math from
the code is daunting task for most people.
An alternative optimization strategy would be to leave c1 as it is and
thereby avoid to decay twice, and only add up c3 and c4 before scaling
reducing the number of scaling operations from three to two.
>
> After we combine them together, we will have much cleaner codes
> and less CPU cycles.
Could you share any numbers backing up that claim?
I did a couple of runs on arm64 (Juno):
perf bench sched messaging -g 50 -l 2500' (10 runs):
tip: 65.545996712 seconds time elapsed ( +- 0.22% )
patch: 65.209931026 seconds time elapsed ( +- 0.23% )
Taking the error margins into account the difference there is still an
improvement, but it is about as small as it can be without getting lost
in the noise. Is the picture any better on x86?
Whether the code is cleaner is a subjective opinion. The diffstat below
doesn't really show any benefit, but I think you have slightly more
comments so I would not judge based on that.
When it comes to code structure, the new __update_sched_avg() is a lot
simpler than __update_load_avg(), but that is only due to the fact that
most of the content has been move to accumulate_sum() and
__accumulate_sum(). Where we before had all code in a single function
with fitted on screen in one go, you know have to consider three
functions and how they work together to figure out what is really going
on.
I haven't found any bugs in the code, but IMHO I don't really see the
point in rewriting the code completely. The result isn't significantly
simpler than what we have and generates code churn affecting everyone
working with this code. I think we can improve the existing code more by
just factoring out the capacity/weight scaling, which would be much less
intrusive.
Maybe I'm missing something, but I don't see a strong argument for this
refactoring.
> Signed-off-by: Yuyang Du <yuyang.du@xxxxxxxxx>
> ---
> kernel/sched/fair.c | 189 ++++++++++++++++++++++++++-------------------------
> 1 file changed, 95 insertions(+), 94 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a060ef2..495e5f0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -668,7 +668,7 @@ static unsigned long task_h_load(struct task_struct *p);
> */
> #define SCHED_AVG_HALFLIFE 32 /* number of periods as a half-life */
> #define SCHED_AVG_MAX 47742 /* maximum possible sched avg */
> -#define SCHED_AVG_MAX_N 345 /* number of full periods to produce SCHED_AVG_MAX */
> +#define SCHED_AVG_MAX_N 347 /* number of full periods to produce SCHED_AVG_MAX */
Why +2? I doesn't seem related to anything in this patch or explained
anywhere.
>
> /* Give new sched_entity start runnable values to heavy its load in infant time */
> void init_entity_runnable_average(struct sched_entity *se)
> @@ -2606,7 +2606,7 @@ static const u32 __accumulated_sum_N[] = {
>
> /*
> * Precomputed \Sum y^k { 1<=k<=n, where n%32=0). Values are rolled down to
> - * lower integers.
> + * lower integers. Since n < SCHED_AVG_MAX_N, n/SCHED_AVG_HALFLIFE < 11
> */
> static const u32 __accumulated_sum_N32[] = {
> 0, 23371, 35056, 40899, 43820, 45281,
> @@ -2616,8 +2616,11 @@ static const u32 __accumulated_sum_N32[] = {
> /*
> * val * y^n, where y^m ~= 0.5
> *
> - * n is the number of periods past; a period is ~1ms
> + * n is the number of periods past. A period is ~1ms, so a 32bit
> + * integer can hold approximately a maximum of 49 (=2^32/1000/3600/24) days.
> + *
> * m is called half-life in exponential decay; here it is SCHED_AVG_HALFLIFE=32.
> + *
> */
> static __always_inline u64 __decay_sum(u64 val, u32 n)
> {
The above two hunks seem to belong to some of the previous patches in
this patch set?
> @@ -2649,20 +2652,30 @@ static __always_inline u64 __decay_sum(u64 val, u32 n)
> * We can compute this efficiently by combining:
> * y^32 = 1/2 with precomputed \Sum 1024*y^n (where n < 32)
> */
> -static __always_inline u32 __accumulate_sum(u32 n)
> +static __always_inline u32
> +__accumulate_sum(u32 periods, u32 period_contrib, u32 remainder)
> {
> - u32 contrib = 0;
> + u32 contrib;
> +
> + if (!periods)
> + return remainder - period_contrib;
>
> - if (likely(n <= SCHED_AVG_HALFLIFE))
> - return __accumulated_sum_N[n];
> - else if (unlikely(n >= SCHED_AVG_MAX_N))
> + if (unlikely(periods >= SCHED_AVG_MAX_N))
> return SCHED_AVG_MAX;
>
> - /* Since n < SCHED_AVG_MAX_N, n/SCHED_AVG_HALFLIFE < 11 */
> - contrib = __accumulated_sum_N32[n/SCHED_AVG_HALFLIFE];
> - n %= SCHED_AVG_HALFLIFE;
> - contrib = __decay_sum(contrib, n);
> - return contrib + __accumulated_sum_N[n];
> + remainder += __decay_sum((u64)(1024 - period_contrib), periods);
> +
> + periods -= 1;
> + if (likely(periods <= SCHED_AVG_HALFLIFE))
> + contrib = __accumulated_sum_N[periods];
> + else {
> + contrib = __accumulated_sum_N32[periods/SCHED_AVG_HALFLIFE];
> + periods %= SCHED_AVG_HALFLIFE;
> + contrib = __decay_sum(contrib, periods);
> + contrib += __accumulated_sum_N[periods];
> + }
> +
> + return contrib + remainder;
> }
>
> #if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10
> @@ -2671,6 +2684,55 @@ static __always_inline u32 __accumulate_sum(u32 n)
>
> #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
>
> +static __always_inline u32 accumulate_sum(u64 delta, struct sched_avg *sa,
> + struct cfs_rq *cfs_rq, int cpu, unsigned long weight, int running)
The unit and meaning of 'delta' confused me a lot when reviewing this
patch. Here it is the time delta since last update in [us] (duration of
c1+c3+c4).
> +{
> + u32 contrib, periods;
> + unsigned long scale_freq, scale_cpu;
> +
> + scale_freq = arch_scale_freq_capacity(NULL, cpu);
> + scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> +
> + delta += sa->period_contrib;
Here it is extended to include time previous 1ms boundary.
> + periods = delta >> 10; /* A period is 1024us (~1ms) */
> +
> + /*
> + * Accumulating *_sum has two steps.
> + *
> + * Step 1: decay old *_sum if we crossed period boundaries.
> + */
> + if (periods) {
> + sa->load_sum = __decay_sum(sa->load_sum, periods);
> + if (cfs_rq) {
> + cfs_rq->runnable_load_sum =
> + __decay_sum(cfs_rq->runnable_load_sum, periods);
> + }
> + sa->util_sum = __decay_sum((u64)(sa->util_sum), periods);
> + }
> +
> + /*
> + * Step 2: accumulate new *_sum since last_update_time. This at most has
> + * three parts (at least one part): (1) remainder of incomplete last
> + * period, (2) full periods since (1), and (3) incomplete current period.
> + *
> + * Fortunately, we can (and should) do all these three at once.
> + */
> + delta %= 1024;
Now 'delta' is any leftover contribution to the current 1ms period
(duration of c4).
> + contrib = __accumulate_sum(periods, sa->period_contrib, delta);
> + sa->period_contrib = delta;
> +
> + contrib = cap_scale(contrib, scale_freq);
> + if (weight) {
> + sa->load_sum += weight * contrib;
> + if (cfs_rq)
> + cfs_rq->runnable_load_sum += weight * contrib;
> + }
> + if (running)
> + sa->util_sum += contrib * scale_cpu;
> +
> + return periods;
> +}
> +
> /*
> * We can represent the historical contribution to sched average as the
> * coefficients of a geometric series. To do this we divide the history
> @@ -2701,12 +2763,9 @@ static __always_inline u32 __accumulate_sum(u32 n)
> */
> static __always_inline int
> __update_sched_avg(u64 now, int cpu, struct sched_avg *sa,
> - unsigned long weight, int running, struct cfs_rq *cfs_rq)
> + unsigned long weight, int running, struct cfs_rq *cfs_rq)
> {
> - u64 delta, scaled_delta;
> - u32 contrib, periods;
> - unsigned int delta_w, scaled_delta_w, decayed = 0;
> - unsigned long scale_freq, scale_cpu;
> + u64 delta;
>
> delta = now - sa->last_update_time;
'delta' is true delta, but in [ns] here.
I would suggest clearly specifying what each parameter is and its unit
for each of the helper functions to help people a bit.
Morten