Re: [PATCH v2] sched/fair: prefer prev cpu in asymmetric wakeup path
From: Vincent Guittot
Date: Thu Oct 29 2020 - 09:49:34 EST
On Thu, 29 Oct 2020 at 13:17, Tao Zhou <ouwen210@xxxxxxxxxxx> wrote:
>
> Hi Vincent,
>
> On Wed, Oct 28, 2020 at 06:44:12PM +0100, Vincent Guittot wrote:
> > During fast wakeup path, scheduler always check whether local or prev cpus
> > are good candidates for the task before looking for other cpus in the
> > domain. With
> > commit b7a331615d25 ("sched/fair: Add asymmetric CPU capacity wakeup scan")
> > the heterogenous system gains a dedicated path but doesn't try to reuse
> > prev cpu whenever possible. If the previous cpu is idle and belong to the
> > LLC domain, we should check it 1st before looking for another cpu because
> > it stays one of the best candidate and this also stabilizes task placement
> > on the system.
> >
> > This change aligns asymmetric path behavior with symmetric one and reduces
> > cases where the task migrates across all cpus of the sd_asym_cpucapacity
> > domains at wakeup.
> >
> > This change does not impact normal EAS mode but only the overloaded case or
> > when EAS is not used.
> >
> > - On hikey960 with performance governor (EAS disable)
> >
> > ./perf bench sched pipe -T -l 50000
> > mainline w/ patch
> > # migrations 999364 0
> > ops/sec 149313(+/-0.28%) 182587(+/- 0.40) +22%
> >
> > - On hikey with performance governor
> >
> > ./perf bench sched pipe -T -l 50000
> > mainline w/ patch
> > # migrations 0 0
> > ops/sec 47721(+/-0.76%) 47899(+/- 0.56) +0.4%
> >
> > According to test on hikey, the patch doesn't impact symmetric system
> > compared to current implementation (only tested on arm64)
> >
> > Also read the uclamped value of task's utilization at most twice instead
> > instead each time we compare task's utilization with cpu's capacity.
> >
> > Fixes: b7a331615d25 ("sched/fair: Add asymmetric CPU capacity wakeup scan")
> > Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> > ---
> > Changes in v2:
> > - merge asymmetric and symmetric path instead of duplicating tests on target,
> > prev and other special cases.
> >
> > - factorize call to uclamp_task_util(p) and use fits_capacity(). This could
> > explain part of the additionnal improvement compared to v1 (+22% instead of
> > +17% on v1).
> >
> > - Keep using LLC instead of asym domain for early check of target, prev and
> > recent_used_cpu to ensure cache sharing between the task. This doesn't
> > change anything for dynamiQ but will ensure same cache for legacy big.LITTLE
> > and also simply the changes.
> >
> > - don't check capacity for the per-cpu kthread UC because the assumption is
> > that the wakee queued work for the per-cpu kthread that is now complete and
> > the task was already on this cpu.
> >
> > - On an asymmetric system where an exclusive cpuset defines a symmetric island,
> > task's load is synced and tested although it's not needed. But taking care of
> > this special case by testing if sd_asym_cpucapacity is not null impacts by
> > more than 4% the performance of default sched_asym_cpucapacity path.
> >
> > - The huge increase of the number of migration for hikey960 mainline comes from
> > teh fact that the ftrace buffer was overloaded by events in the tests done
> > with v1.
> >
> > kernel/sched/fair.c | 68 ++++++++++++++++++++++++++++-----------------
> > 1 file changed, 43 insertions(+), 25 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index aa4c6227cd6d..131b917b70f8 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6173,20 +6173,20 @@ static int
> > select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > {
> > unsigned long best_cap = 0;
> > - int cpu, best_cpu = -1;
> > + int task_util, cpu, best_cpu = -1;
> > struct cpumask *cpus;
> >
> > - sync_entity_load_avg(&p->se);
> > -
> > cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> > cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> >
> > + task_util = uclamp_task_util(p);
>
> The return type of uclamp_task_util() is *unsigned long*.
> But, task_util is *int*.
argh, yes I have been fooled by the return type of task_fits_capacity
Form a functional pov, we were safe because task's util_avg can't go above 1024
>
> @Valentin: I checked that the return type of uclamp_task_util()
> is changed from unsigned int to unsigned long by you
>
>
> > for_each_cpu_wrap(cpu, cpus, target) {
> > unsigned long cpu_cap = capacity_of(cpu);
> >
> > if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
> > continue;
> > - if (task_fits_capacity(p, cpu_cap))
> > + if (fits_capacity(task_util, cpu_cap))
> > return cpu;
> >
> > if (cpu_cap > best_cap) {
> > @@ -6198,44 +6198,41 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > return best_cpu;
> > }
> >
> > +static inline int asym_fits_capacity(int task_util, int cpu)
> > +{
> > + if (static_branch_unlikely(&sched_asym_cpucapacity))
> > + return fits_capacity(task_util, capacity_of(cpu));
> > +
> > + return 1;
> > +}
>
> The return type of asym_fits_capacity() could be *bool*.
>
> 'return fits_capacity(task_util, capacity_of(cpu));'
>
> is equivalent to:
>
> 'return ((cap) * 1280 < (max) * 1024);'
>
> I do not know anything far enough, just notice may stack more bytes.
>
> > /*
> > * Try and locate an idle core/thread in the LLC cache domain.
> > */
> > static int select_idle_sibling(struct task_struct *p, int prev, int target)
> > {
> > struct sched_domain *sd;
> > - int i, recent_used_cpu;
> > + int i, recent_used_cpu, task_util;
> >
> > /*
> > - * For asymmetric CPU capacity systems, our domain of interest is
> > - * sd_asym_cpucapacity rather than sd_llc.
> > + * On asymmetric system, update task utilization because we will check
> > + * that the task fits with cpu's capacity.
> > */
> > if (static_branch_unlikely(&sched_asym_cpucapacity)) {
> > - sd = rcu_dereference(per_cpu(sd_asym_cpucapacity, target));
> > - /*
> > - * On an asymmetric CPU capacity system where an exclusive
> > - * cpuset defines a symmetric island (i.e. one unique
> > - * capacity_orig value through the cpuset), the key will be set
> > - * but the CPUs within that cpuset will not have a domain with
> > - * SD_ASYM_CPUCAPACITY. These should follow the usual symmetric
> > - * capacity path.
> > - */
> > - if (!sd)
> > - goto symmetric;
> > -
> > - i = select_idle_capacity(p, sd, target);
> > - return ((unsigned)i < nr_cpumask_bits) ? i : target;
> > + sync_entity_load_avg(&p->se);
> > + task_util = uclamp_task_util(p);
> > }
> >
> > -symmetric:
> > - if (available_idle_cpu(target) || sched_idle_cpu(target))
> > + if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
> > + asym_fits_capacity(task_util, target))
> > return target;
> >
> > /*
> > * If the previous CPU is cache affine and idle, don't be stupid:
> > */
> > if (prev != target && cpus_share_cache(prev, target) &&
> > - (available_idle_cpu(prev) || sched_idle_cpu(prev)))
> > + (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
> > + asym_fits_capacity(task_util, prev))
> > return prev;
> >
> > /*
> > @@ -6258,7 +6255,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> > recent_used_cpu != target &&
> > cpus_share_cache(recent_used_cpu, target) &&
> > (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
> > - cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr)) {
> > + cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
> > + asym_fits_capacity(task_util, recent_used_cpu)) {
> > /*
> > * Replace recent_used_cpu with prev as it is a potential
> > * candidate for the next wake:
> > @@ -6267,6 +6265,26 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> > return recent_used_cpu;
> > }
> >
> > + /*
> > + * For asymmetric CPU capacity systems, our domain of interest is
> > + * sd_asym_cpucapacity rather than sd_llc.
> > + */
> > + if (static_branch_unlikely(&sched_asym_cpucapacity)) {
> > + sd = rcu_dereference(per_cpu(sd_asym_cpucapacity, target));
> > + /*
> > + * On an asymmetric CPU capacity system where an exclusive
> > + * cpuset defines a symmetric island (i.e. one unique
> > + * capacity_orig value through the cpuset), the key will be set
> > + * but the CPUs within that cpuset will not have a domain with
> > + * SD_ASYM_CPUCAPACITY. These should follow the usual symmetric
> > + * capacity path.
> > + */
> > + if (sd) {
> > + i = select_idle_capacity(p, sd, target);
> > + return ((unsigned)i < nr_cpumask_bits) ? i : target;
> > + }
> > + }
> > +
> > sd = rcu_dereference(per_cpu(sd_llc, target));
> > if (!sd)
> > return target;
> > --
> > 2.17.1