Re: [PATCH v2] sched: rt: Make RT capacity aware

From: Qais Yousef
Date: Wed Oct 23 2019 - 08:34:26 EST


Adding some Android folks who might be interested.

Steven/Peter, in case this has dropped off your queue; it'd be great to get
some feedback when you get a chance to look at it.

Thanks

--
Qais Yousef

On 10/09/19 11:46, Qais Yousef wrote:
> Capacity Awareness refers to the fact that on heterogeneous systems
> (like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
> when placing tasks we need to be aware of this difference of CPU
> capacities.
>
> In such scenarios we want to ensure that the selected CPU has enough
> capacity to meet the requirement of the running task. Enough capacity
> means here that capacity_orig_of(cpu) >= task.requirement.
>
> The definition of task.requirement is dependent on the scheduling class.
>
> For CFS, utilization is used to select a CPU that has >= capacity value
> than the cfs_task.util.
>
> capacity_orig_of(cpu) >= cfs_task.util
>
> DL isn't capacity aware at the moment but can make use of the bandwidth
> reservation to implement that in a similar manner CFS uses utilization.
> The following patchset implements that:
>
> https://lore.kernel.org/lkml/20190506044836.2914-1-luca.abeni@xxxxxxxxxxxxxxx/
>
> capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime
>
> For RT we don't have a per task utilization signal and we lack any
> information in general about what performance requirement the RT task
> needs. But with the introduction of uclamp, RT tasks can now control
> that by setting uclamp_min to guarantee a minimum performance point.
>
> ATM the uclamp value are only used for frequency selection; but on
> heterogeneous systems this is not enough and we need to ensure that the
> capacity of the CPU is >= uclamp_min. Which is what implemented here.
>
> capacity_orig_of(cpu) >= rt_task.uclamp_min
>
> Note that by default uclamp.min is 1024, which means that RT tasks will
> always be biased towards the big CPUs, which make for a better more
> predictable behavior for the default case.
>
> Must stress that the bias acts as a hint rather than a definite
> placement strategy. For example, if all big cores are busy executing
> other RT tasks we can't guarantee that a new RT task will be placed
> there.
>
> On non-heterogeneous systems the original behavior of RT should be
> retained. Similarly if uclamp is not selected in the config.
>
> Signed-off-by: Qais Yousef <qais.yousef@xxxxxxx>
> ---
>
> Changes in v2:
> - Use cpupri_find() to check the fitness of the task instead of
> sprinkling find_lowest_rq() with several checks of
> rt_task_fits_capacity().
>
> The selected implementation opted to pass the fitness function as an
> argument rather than call rt_task_fits_capacity() capacity which is
> a cleaner to keep the logical separation of the 2 modules; but it
> means the compiler has less room to optimize rt_task_fits_capacity()
> out when it's a constant value.
>
> The logic is not perfect. For example if a 'small' task is occupying a big CPU
> and another big task wakes up; we won't force migrate the small task to clear
> the big cpu for the big task that woke up.
>
> IOW, the logic is best effort and can't give hard guarantees. But improves the
> current situation where a task can randomly end up on any CPU regardless of
> what it needs. ie: without this patch an RT task can wake up on a big or small
> CPU, but with this it will always wake up on a big CPU (assuming the big CPUs
> aren't overloaded) - hence provide a consistent performance.
>
> I'm looking at ways to improve this best effort, but this patch should be
> a good start to discuss our Capacity Awareness requirement. There's a trade-off
> of complexity to be made here and I'd like to keep things as simple as
> possible and build on top as needed.
>
>
> kernel/sched/cpupri.c | 23 ++++++++++--
> kernel/sched/cpupri.h | 4 ++-
> kernel/sched/rt.c | 81 +++++++++++++++++++++++++++++++++++--------
> 3 files changed, 91 insertions(+), 17 deletions(-)
>
> diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
> index b7abca987d94..799791c01d60 100644
> --- a/kernel/sched/cpupri.c
> +++ b/kernel/sched/cpupri.c
> @@ -57,7 +57,8 @@ static int convert_prio(int prio)
> * Return: (int)bool - CPUs were found
> */
> int cpupri_find(struct cpupri *cp, struct task_struct *p,
> - struct cpumask *lowest_mask)
> + struct cpumask *lowest_mask,
> + bool (*fitness_fn)(struct task_struct *p, int cpu))
> {
> int idx = 0;
> int task_pri = convert_prio(p->prio);
> @@ -98,6 +99,8 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
> continue;
>
> if (lowest_mask) {
> + int cpu;
> +
> cpumask_and(lowest_mask, p->cpus_ptr, vec->mask);
>
> /*
> @@ -108,7 +111,23 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
> * condition, simply act as though we never hit this
> * priority level and continue on.
> */
> - if (cpumask_any(lowest_mask) >= nr_cpu_ids)
> + if (cpumask_empty(lowest_mask))
> + continue;
> +
> + if (!fitness_fn)
> + return 1;
> +
> + /* Ensure the capacity of the CPUs fit the task */
> + for_each_cpu(cpu, lowest_mask) {
> + if (!fitness_fn(p, cpu))
> + cpumask_clear_cpu(cpu, lowest_mask);
> + }
> +
> + /*
> + * If no CPU at the current priority can fit the task
> + * continue looking
> + */
> + if (cpumask_empty(lowest_mask))
> continue;
> }
>
> diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
> index 7dc20a3232e7..32dd520db11f 100644
> --- a/kernel/sched/cpupri.h
> +++ b/kernel/sched/cpupri.h
> @@ -18,7 +18,9 @@ struct cpupri {
> };
>
> #ifdef CONFIG_SMP
> -int cpupri_find(struct cpupri *cp, struct task_struct *p, struct cpumask *lowest_mask);
> +int cpupri_find(struct cpupri *cp, struct task_struct *p,
> + struct cpumask *lowest_mask,
> + bool (*fitness_fn)(struct task_struct *p, int cpu));
> void cpupri_set(struct cpupri *cp, int cpu, int pri);
> int cpupri_init(struct cpupri *cp);
> void cpupri_cleanup(struct cpupri *cp);
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index ebaa4e619684..3a68054e15b3 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -437,6 +437,45 @@ static inline int on_rt_rq(struct sched_rt_entity *rt_se)
> return rt_se->on_rq;
> }
>
> +#ifdef CONFIG_UCLAMP_TASK
> +/*
> + * Verify the fitness of task @p to run on @cpu taking into account the uclamp
> + * settings.
> + *
> + * This check is only important for heterogeneous systems where uclamp_min value
> + * is higher than the capacity of a @cpu. For non-heterogeneous system this
> + * function will always return true.
> + *
> + * The function will return true if the capacity of the @cpu is >= the
> + * uclamp_min and false otherwise.
> + *
> + * Note that uclamp_min will be clamped to uclamp_max if uclamp_min
> + * > uclamp_max.
> + */
> +inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> +{
> + unsigned int min_cap;
> + unsigned int max_cap;
> + unsigned int cpu_cap;
> +
> + /* Only heterogeneous systems can benefit from this check */
> + if (!static_branch_unlikely(&sched_asym_cpucapacity))
> + return true;
> +
> + min_cap = uclamp_eff_value(p, UCLAMP_MIN);
> + max_cap = uclamp_eff_value(p, UCLAMP_MAX);
> +
> + cpu_cap = capacity_orig_of(cpu);
> +
> + return cpu_cap >= min(min_cap, max_cap);
> +}
> +#else
> +static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu)
> +{
> + return true;
> +}
> +#endif
> +
> #ifdef CONFIG_RT_GROUP_SCHED
>
> static inline u64 sched_rt_runtime(struct rt_rq *rt_rq)
> @@ -1391,6 +1430,7 @@ select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
> {
> struct task_struct *curr;
> struct rq *rq;
> + bool test;
>
> /* For anything but wake ups, just return the task_cpu */
> if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK)
> @@ -1422,10 +1462,16 @@ select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
> *
> * This test is optimistic, if we get it wrong the load-balancer
> * will have to sort it out.
> + *
> + * We take into account the capacity of the cpu to ensure it fits the
> + * requirement of the task - which is only important on heterogeneous
> + * systems like big.LITTLE.
> */
> - if (curr && unlikely(rt_task(curr)) &&
> - (curr->nr_cpus_allowed < 2 ||
> - curr->prio <= p->prio)) {
> + test = curr &&
> + unlikely(rt_task(curr)) &&
> + (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);
> +
> + if (test || !rt_task_fits_capacity(p, cpu)) {
> int target = find_lowest_rq(p);
>
> /*
> @@ -1449,7 +1495,7 @@ static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
> * let's hope p can move out.
> */
> if (rq->curr->nr_cpus_allowed == 1 ||
> - !cpupri_find(&rq->rd->cpupri, rq->curr, NULL))
> + !cpupri_find(&rq->rd->cpupri, rq->curr, NULL, NULL))
> return;
>
> /*
> @@ -1457,7 +1503,7 @@ static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
> * see if it is pushed or pulled somewhere else.
> */
> if (p->nr_cpus_allowed != 1
> - && cpupri_find(&rq->rd->cpupri, p, NULL))
> + && cpupri_find(&rq->rd->cpupri, p, NULL, NULL))
> return;
>
> /*
> @@ -1600,7 +1646,8 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct rq_fla
> static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
> {
> if (!task_running(rq, p) &&
> - cpumask_test_cpu(cpu, p->cpus_ptr))
> + cpumask_test_cpu(cpu, p->cpus_ptr) &&
> + rt_task_fits_capacity(p, cpu))
> return 1;
>
> return 0;
> @@ -1642,7 +1689,8 @@ static int find_lowest_rq(struct task_struct *task)
> if (task->nr_cpus_allowed == 1)
> return -1; /* No other targets possible */
>
> - if (!cpupri_find(&task_rq(task)->rd->cpupri, task, lowest_mask))
> + if (!cpupri_find(&task_rq(task)->rd->cpupri, task, lowest_mask,
> + rt_task_fits_capacity))
> return -1; /* No targets found */
>
> /*
> @@ -2146,12 +2194,14 @@ static void pull_rt_task(struct rq *this_rq)
> */
> static void task_woken_rt(struct rq *rq, struct task_struct *p)
> {
> - if (!task_running(rq, p) &&
> - !test_tsk_need_resched(rq->curr) &&
> - p->nr_cpus_allowed > 1 &&
> - (dl_task(rq->curr) || rt_task(rq->curr)) &&
> - (rq->curr->nr_cpus_allowed < 2 ||
> - rq->curr->prio <= p->prio))
> + bool need_to_push = !task_running(rq, p) &&
> + !test_tsk_need_resched(rq->curr) &&
> + p->nr_cpus_allowed > 1 &&
> + (dl_task(rq->curr) || rt_task(rq->curr)) &&
> + (rq->curr->nr_cpus_allowed < 2 ||
> + rq->curr->prio <= p->prio);
> +
> + if (need_to_push || !rt_task_fits_capacity(p, cpu_of(rq)))
> push_rt_tasks(rq);
> }
>
> @@ -2223,7 +2273,10 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
> */
> if (task_on_rq_queued(p) && rq->curr != p) {
> #ifdef CONFIG_SMP
> - if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
> + bool need_to_push = rq->rt.overloaded ||
> + !rt_task_fits_capacity(p, cpu_of(rq));
> +
> + if (p->nr_cpus_allowed > 1 && need_to_push)
> rt_queue_push_tasks(rq);
> #endif /* CONFIG_SMP */
> if (p->prio < rq->curr->prio && cpu_online(cpu_of(rq)))
> --
> 2.17.1
>