Re: [PATCH v3 10/13] sched/fair: Compute task/cpu utilization at wake-up more correctly
From: Morten Rasmussen
Date: Thu Aug 18 2016 - 09:47:47 EST
On Thu, Aug 18, 2016 at 07:46:44PM +0800, Wanpeng Li wrote:
> 2016-08-18 18:24 GMT+08:00 Morten Rasmussen <morten.rasmussen@xxxxxxx>:
> > On Thu, Aug 18, 2016 at 09:40:55AM +0100, Morten Rasmussen wrote:
> >> On Mon, Aug 15, 2016 at 04:42:37PM +0100, Morten Rasmussen wrote:
> >> > On Mon, Aug 15, 2016 at 04:23:42PM +0200, Peter Zijlstra wrote:
> >> > > But unlike that function, it doesn't actually use __update_load_avg().
> >> > > Why not?
> >> >
> >> > Fair question :)
> >> >
> >> > We currently exploit the fact that the task utilization is _not_ updated
> >> > in wake-up balancing to make sure we don't under-estimate the capacity
> >> > requirements for tasks that have slept for a while. If we update it, we
> >> > loose the non-decayed 'peak' utilization, but I guess we could just
> >> > store it somewhere when we do the wake-up decay.
> >> >
> >> > I thought there was a better reason when I wrote the patch, but I don't
> >> > recall right now. I will look into it again and see if we can use
> >> > __update_load_avg() to do a proper update instead of doing things twice.
> >>
> >> AFAICT, we should be able to synchronize the task utilization to the
> >> previous rq utilization using __update_load_avg() as you suggest. The
> >> patch below is should work as a replacement without any changes to
> >> subsequent patches. It doesn't solve the under-estimation issue, but I
> >> have another patch for that.
> >
> > And here is a possible solution to the under-estimation issue. The patch
> > would have to go at the end of this set.
> >
> > ---8<---
> >
> > From 5bc918995c6c589b833ba1f189a8b92fa22202ae Mon Sep 17 00:00:00 2001
> > From: Morten Rasmussen <morten.rasmussen@xxxxxxx>
> > Date: Wed, 17 Aug 2016 15:30:43 +0100
> > Subject: [PATCH] sched/fair: Track peak per-entity utilization
> >
> > When using PELT (per-entity load tracking) utilization to place tasks at
> > wake-up using the decayed utilization (due to sleep) leads to
> > under-estimation of true utilization of the task. This could mean
> > putting the task on a cpu with less available capacity than is actually
> > needed. This issue can be mitigated by using 'peak' utilization instead
> > of the decayed utilization for placement decisions, e.g. at task
> > wake-up.
> >
> > The 'peak' utilization metric, util_peak, tracks util_avg when the task
> > is running and retains its previous value while the task is
> > blocked/waiting on the rq. It is instantly updated to track util_avg
> > again as soon as the task running again.
>
> Maybe this will lead to disable wake affine due to a spike peak value
> for a low average load task.
I assume you are referring to using task_util_peak() instead of
task_util() in wake_cap()?
The peak value should never exceed the util_avg accumulated by the task
last time it ran. So any spike has to be caused by the task accumulating
more utilization last time it ran. We don't know if it a spike or a more
permanent change in behaviour, so we have to guess. So a spike on an
asymmetric system could cause us to disable wake affine in some
circumstances (either prev_cpu or waker cpu has to be low compute
capacity) for the following wake-up.
SMP should be unaffected as we should bail out on the previous
condition.
The counter-example is task with a fairly long busy period and a much
longer period (cycle). Its util_avg might have decayed away since the
last activation so it appears very small at wake-up and we end up
putting it on a low capacity cpu every time even though it keeps the cpu
busy for a long time every time it wakes up.
Did that answer your question?
Thanks,
Morten
>
> Regards,
> Wanpeng Li
>
> >
> > cc: Ingo Molnar <mingo@xxxxxxxxxx>
> > cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> >
> > Signed-off-by: Morten Rasmussen <morten.rasmussen@xxxxxxx>
> > ---
> > include/linux/sched.h | 2 +-
> > kernel/sched/fair.c | 18 ++++++++++++++----
> > 2 files changed, 15 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 4e0c47af9b05..40e427d1d378 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1281,7 +1281,7 @@ struct load_weight {
> > struct sched_avg {
> > u64 last_update_time, load_sum;
> > u32 util_sum, period_contrib;
> > - unsigned long load_avg, util_avg;
> > + unsigned long load_avg, util_avg, util_peak;
> > };
> >
> > #ifdef CONFIG_SCHEDSTATS
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 11b250531ed4..8462a3d455ff 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -692,6 +692,7 @@ void init_entity_runnable_average(struct sched_entity *se)
> > * At this point, util_avg won't be used in select_task_rq_fair anyway
> > */
> > sa->util_avg = 0;
> > + sa->util_peak = 0;
> > sa->util_sum = 0;
> > /* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
> > }
> > @@ -744,6 +745,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
> > } else {
> > sa->util_avg = cap;
> > }
> > + sa->util_peak = sa->util_avg;
> > sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
> > }
> >
> > @@ -2806,6 +2808,9 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> > sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
> > }
> >
> > + if (running || sa->util_avg > sa->util_peak)
> > + sa->util_peak = sa->util_avg;
> > +
> > return decayed;
> > }
> >
> > @@ -5174,7 +5179,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
> > return 1;
> > }
> >
> > -static inline int task_util(struct task_struct *p);
> > +static inline int task_util_peak(struct task_struct *p);
> > static int cpu_util_wake(int cpu, struct task_struct *p);
> >
> > static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
> > @@ -5257,10 +5262,10 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> > } while (group = group->next, group != sd->groups);
> >
> > /* Found a significant amount of spare capacity. */
> > - if (this_spare > task_util(p) / 2 &&
> > + if (this_spare > task_util_peak(p) / 2 &&
> > imbalance*this_spare > 100*most_spare)
> > return NULL;
> > - else if (most_spare > task_util(p) / 2)
> > + else if (most_spare > task_util_peak(p) / 2)
> > return most_spare_sg;
> >
> > if (!idlest || 100*this_load < imbalance*min_load)
> > @@ -5423,6 +5428,11 @@ static inline int task_util(struct task_struct *p)
> > return p->se.avg.util_avg;
> > }
> >
> > +static inline int task_util_peak(struct task_struct *p)
> > +{
> > + return p->se.avg.util_peak;
> > +}
> > +
> > /*
> > * cpu_util_wake: Compute cpu utilization with any contributions from
> > * the waking task p removed.
> > @@ -5455,7 +5465,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
> > /* Bring task utilization in sync with prev_cpu */
> > sync_entity_load_avg(&p->se);
> >
> > - return min_cap * 1024 < task_util(p) * capacity_margin;
> > + return min_cap * 1024 < task_util_peak(p) * capacity_margin;
> > }
> >
> > /*
> > --
> > 1.9.1
> >
>
>
>
> --
> Regards,
> Wanpeng Li