Re: [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator

From: Joel Fernandes
Date: Fri Apr 20 2018 - 04:13:18 EST


On Wed, Apr 18, 2018 at 4:17 AM, Quentin Perret <quentin.perret@xxxxxxx> wrote:
> On Friday 13 Apr 2018 at 16:56:39 (-0700), Joel Fernandes wrote:
>> Hi,
>>
>> On Fri, Apr 6, 2018 at 8:36 AM, Dietmar Eggemann
>> <dietmar.eggemann@xxxxxxx> wrote:
>> > From: Thara Gopinath <thara.gopinath@xxxxxxxxxx>
>> >
>> > Energy-aware scheduling should only operate when the system is not
>> > overutilized. There must be cpu time available to place tasks based on
>> > utilization in an energy-aware fashion, i.e. to pack tasks on
>> > energy-efficient cpus without harming the overall throughput.
>> >
>> > In case the system operates above this tipping point the tasks have to
>> > be placed based on task and cpu load in the classical way of spreading
>> > tasks across as many cpus as possible.
>> >
>> > The point in which a system switches from being not overutilized to
>> > being overutilized is called the tipping point.
>> >
>> > Such a tipping point indicator on a sched domain as the system
>> > boundary is introduced here. As soon as one cpu of a sched domain is
>> > overutilized the whole sched domain is declared overutilized as well.
>> > A cpu becomes overutilized when its utilization is higher that 80%
>> > (capacity_margin) of its capacity.
>> >
>> > The implementation takes advantage of the shared sched domain which is
>> > shared across all per-cpu views of a sched domain level. The new
>> > overutilized flag is placed in this shared sched domain.
>> >
>> > Load balancing is skipped in case the energy model is present and the
>> > sched domain is not overutilized because under this condition the
>> > predominantly load-per-capacity driven load-balancer should not
>> > interfere with the energy-aware wakeup placement based on utilization.
>> >
>> > In case the total utilization of a sched domain is greater than the
>> > total sched domain capacity the overutilized flag is set at the parent
>> > sched domain level to let other sched groups help getting rid of the
>> > overutilization of cpus.
>> >
>> > Signed-off-by: Thara Gopinath <thara.gopinath@xxxxxxxxxx>
>> > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>
>> > ---
>> > include/linux/sched/topology.h | 1 +
>> > kernel/sched/fair.c | 62 ++++++++++++++++++++++++++++++++++++++++--
>> > kernel/sched/sched.h | 1 +
>> > kernel/sched/topology.c | 12 +++-----
>> > 4 files changed, 65 insertions(+), 11 deletions(-)
>> >
>> > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> > index 26347741ba50..dd001c232646 100644
>> > --- a/include/linux/sched/topology.h
>> > +++ b/include/linux/sched/topology.h
>> > @@ -72,6 +72,7 @@ struct sched_domain_shared {
>> > atomic_t ref;
>> > atomic_t nr_busy_cpus;
>> > int has_idle_cores;
>> > + int overutilized;
>> > };
>> >
>> > struct sched_domain {
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index 0a76ad2ef022..6960e5ef3c14 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -5345,6 +5345,28 @@ static inline void hrtick_update(struct rq *rq)
>> > }
>> > #endif
>> >
>> > +#ifdef CONFIG_SMP
>> > +static inline int cpu_overutilized(int cpu);
>> > +
>> > +static inline int sd_overutilized(struct sched_domain *sd)
>> > +{
>> > + return READ_ONCE(sd->shared->overutilized);
>> > +}
>> > +
>> > +static inline void update_overutilized_status(struct rq *rq)
>> > +{
>> > + struct sched_domain *sd;
>> > +
>> > + rcu_read_lock();
>> > + sd = rcu_dereference(rq->sd);
>> > + if (sd && !sd_overutilized(sd) && cpu_overutilized(rq->cpu))
>> > + WRITE_ONCE(sd->shared->overutilized, 1);
>> > + rcu_read_unlock();
>> > +}
>> > +#else
>> > +static inline void update_overutilized_status(struct rq *rq) {}
>> > +#endif /* CONFIG_SMP */
>> > +
>> > /*
>> > * The enqueue_task method is called before nr_running is
>> > * increased. Here we update the fair scheduling stats and
>> > @@ -5394,8 +5416,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>> > update_cfs_group(se);
>> > }
>> >
>> > - if (!se)
>> > + if (!se) {
>> > add_nr_running(rq, 1);
>> > + update_overutilized_status(rq);
>> > + }
>>
>> I'm wondering if it makes sense for considering scenarios whether
>> other classes cause CPUs in the domain to go above the tipping point.
>> Then in that case also, it makes sense to not to do EAS in that domain
>> because of the overutilization.
>>
>> I guess task_fits using cpu_util which is PELT only at the moment...
>> so may require some other method like aggregation of CFS PELT, with
>> RT-PELT and DL running bw or something.
>>
>
> So at the moment in cpu_overutilized() we comapre cpu_util() to
> capacity_of() which should include RT and IRQ pressure IIRC. But
> you're right, we might be able to do more here... Perhaps we
> could also use cpu_util_dl() which is available in sched.h now ?

Yes, should be Ok, and then when RT utilization stuff is available,
then that can be included in the equation as well (probably for now
you could use rt_avg).

Another crazy idea is to check the contribution of higher classes in
one-shot with (capacity_orig_of - capacity_of) although I think that
method would be less instantaneous/accurate.

thanks,

- Joel