Re: [RFC PATCH 1/2] sched/rt: add utilization tracking

From: Morten Rasmussen
Date: Tue May 30 2017 - 11:50:49 EST


On Wed, May 24, 2017 at 11:00:51AM +0200, Vincent Guittot wrote:
> schedutil governor relies on cfs_rq's util_avg to choose the OPP when cfs
> tasks are running. When the CPU is overloaded by cfs and rt tasks, cfs tasks
> are preempted by rt tasks and in this case util_avg reflects the remaining
> capacity that is used by cfs tasks but not what cfs tasks want to use. In such
> case, schedutil can select a lower OPP when cfs task runs whereas the CPU is
> overloaded. In order to have a more accurate view of the utilization of the
> CPU, we track the utilization that is used by RT tasks.
> DL tasks are not taken into account as they have their own utilization
> tracking mecanism.
>
> We don't use rt_avg which doesn't have the same dynamic as PELT and which
> can include IRQ time that are also accounted in cfs task utilization

If I get it correctly, the fundamental issue is that we need utilization
metrics for all sched_classes that are sufficiently similar that adding
them up is a useful input to schedutil? rt_avg which already exists
today is too slow to be fit for purpose, so lets try PELT.

It isn't a full PELT implementation in this patch. In fact, it isn't
per-entity :-) However, it might be sufficient to improve things without
causing too much overhead.

>
> Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> ---
>
> If the changes are reasonnable, it might worth moving the PELT function in a
> dedicated pelt.c file and the ugly
> extern int update_rt_rq_load_avg(u64 now, int cpu, struct rt_rq *rt_rq, int running);
> in a pelt.h header
>
>
> kernel/sched/fair.c | 21 +++++++++++++++++++++
> kernel/sched/rt.c | 9 +++++++++
> kernel/sched/sched.h | 3 +++
> 3 files changed, 33 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 47a0c55..0d7698b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2950,6 +2950,17 @@ __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
> cfs_rq->curr != NULL, cfs_rq);
> }
>
> +int update_rt_rq_load_avg(u64 now, int cpu, struct rt_rq *rt_rq, int running)
> +{
> + int ret;
> +
> + ret = ___update_load_avg(now, cpu, &rt_rq->avg, 0, running, NULL);
> +
> +
> + return ret;
> +}
> +
> +
> /*
> * Signed add and clamp on underflow.
> *
> @@ -3478,6 +3489,11 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq, bool update_freq)
> return 0;
> }
>
> +int update_rt_rq_load_avg(u64 now, int cpu, struct rt_rq *rt_rq, int running)
> +{
> + return 0;
> +}
> +
> #define UPDATE_TG 0x0
> #define SKIP_AGE_LOAD 0x0
>
> @@ -7008,6 +7024,10 @@ static void update_blocked_averages(int cpu)
> if (cfs_rq_is_decayed(cfs_rq))
> list_del_leaf_cfs_rq(cfs_rq);
> }
> +
> + update_rt_rq_load_avg(rq_clock_task(rq), cpu, &rq->rt, 0);
> +
> +
> rq_unlock_irqrestore(rq, &rf);
> }

This would be easier if we had a place to stick remote updates of
various metrics for NOHZ idle cpus?

>
> @@ -7067,6 +7087,7 @@ static inline void update_blocked_averages(int cpu)
> rq_lock_irqsave(rq, &rf);
> update_rq_clock(rq);
> update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true);
> + update_rt_rq_load_avg(rq_clock_task(rq), cpu, &rq->rt, 0);
> rq_unlock_irqrestore(rq, &rf);
> }
>
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 581d5c7..09293fa 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1534,6 +1534,8 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
> return p;
> }
>
> +extern int update_rt_rq_load_avg(u64 now, int cpu, struct rt_rq *rt_rq, int running);
> +
> static struct task_struct *
> pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> {
> @@ -1579,6 +1581,10 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>
> queue_push_tasks(rq);
>
> + if (p)
> + update_rt_rq_load_avg(rq_clock_task(rq), cpu_of(rq), rt_rq,
> + rq->curr->sched_class == &rt_sched_class);
> +
> return p;
> }

I'm quite likely missing something, but why is this update need here?

IIUC, put_prev_task() was called just a couple of lines further up which
also does an update (below). The only difference is that this update
depends on the current task being an RT task, which isn't always the
case. Is it supposed to decay the RT PELT average when we pick an RT
task and the previous task !RT? In that case, the "running" argument
needs a ! in front.

Morten