Re: [RFC PATCH 5/5] sched/fair: Add push task callback for EAS

From: Pierre Gondois
Date: Fri Sep 13 2024 - 05:10:48 EST


Hello Vincent,

On 9/12/24 14:30, Vincent Guittot wrote:
Hello Pierre,

On Wed, 11 Sept 2024 at 16:03, Pierre Gondois <pierre.gondois@xxxxxxx> wrote:

Hello Vincent,

On 8/30/24 15:03, Vincent Guittot wrote:
EAS is based on wakeup events to efficiently place tasks on the system, but
there are cases where a task will not have wakeup events anymore or at a
far too low pace. For such situation, we can take advantage of the task
being put back in the enqueued list to check if it should be migrated on
another CPU. When the task is the only one running on the CPU, the tick
will check it the task is stuck on this CPU and should migrate on another
one.

Wake up events remain the main way to migrate tasks but we now detect
situation where a task is stuck on a CPU by checking that its utilization
is larger than the max available compute capacity (max cpu capacity or
uclamp max setting)

Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
---
kernel/sched/fair.c | 211 +++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 2 +
2 files changed, 213 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.cHel
index e46af2416159..41fb18ac118b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c

[snip]

+
+static inline void check_misfit_cpu(struct task_struct *p, struct rq *rq)
+{
+ int new_cpu, cpu = cpu_of(rq);
+
+ if (!sched_energy_enabled())
+ return;
+
+ if (WARN_ON(!p))
+ return;
+
+ if (WARN_ON(p != rq->curr))
+ return;
+
+ if (is_migration_disabled(p))
+ return;
+
+ if ((rq->nr_running > 1) || (p->nr_cpus_allowed == 1))
+ return;

I tried the code on a Pixel6 with the following setup:
- without the above (rq->nr_running > 1) condition
- without the push task mechanism
i.e. tasks without regular wakeups only have the opportunity to
run feec() via the sched_tick. It seemed sufficient to avoid
the problematic you mentioned:
- having unbalanced UCLAMP_MAX tasks in a pd, e.g. 1 UCLAMP_MAX task
per little CPU, except one little CPU with N UCLAMP_MAX tasks
- downgrading UCLAMP_MAX tasks that could run on smaller CPUs
but have no wakeups and thus don't run feec()

The main problem with your test is that you always call feec() for the
running task so you always have to wake up the migration thread to
migrate the current running thread which is quite inefficient. The
push mechanism only takes a task which is not the current running one
and we don't need to wake up migration thread which is simpler and
more efficient. We check only one task at a time and will not loop on
an unbounded number of tasks after a task switch or a tick


Thus I was wondering it it would not be better to integrate the
EAS to the load balancer instead (not my idea, but don't remember
who suggested that).

My 1st thought was also to use load balance to pull tasks which were
stuck on the wrong CPU (as mentioned in [1]) but this solution is not
scalable as we don't want to test all runnable task on a cpu and it's
not really easy to know which cpu and which tasks should be checked

[1] https://youtu.be/PHEBAyxeM_M?si=ZApIOw3BS4SOLPwp

Or otherwise if just running feec() through the sched_tick path
would not be sufficient (i.e. this patch minus the push mechanism).

As mentioned above, the push mechanism is more efficient than active migration.



+
+ if (!task_misfit_cpu(p, cpu))
+ return;
+
+ new_cpu = find_energy_efficient_cpu(p, cpu);
+
+ if (new_cpu == cpu)
+ return;
+
+ /*
+ * ->active_balance synchronizes accesses to
+ * ->active_balance_work. Once set, it's cleared
+ * only after active load balance is finished.
+ */
+ if (!rq->active_balance) {
+ rq->active_balance = 1;
+ rq->push_cpu = new_cpu;
+ } else
+ return;
+
+ raw_spin_rq_unlock(rq);
+ stop_one_cpu_nowait(cpu,
+ active_load_balance_cpu_stop, rq,
+ &rq->active_balance_work);
+ raw_spin_rq_lock(rq);

I didn't hit any error, but isn't it eligible to the following ?
commit f0498d2a54e7 ("sched: Fix stop_one_cpu_nowait() vs hotplug")


I will recheck but being called from the tick, for the local cpu and
with a running thread no being cpu_stopper_thread, should protect us
from the case describe in this commit


+}
+
+static inline int has_pushable_tasks(struct rq *rq)
+{
+ return !plist_head_empty(&rq->cfs.pushable_tasks);
+}
+
+static struct task_struct *pick_next_pushable_fair_task(struct rq *rq)
+{
+ struct task_struct *p;
+
+ if (!has_pushable_tasks(rq))
+ return NULL;
+
+ p = plist_first_entry(&rq->cfs.pushable_tasks,
+ struct task_struct, pushable_tasks);
+
+ WARN_ON_ONCE(rq->cpu != task_cpu(p));
+ WARN_ON_ONCE(task_current(rq, p));
+ WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
+
+ WARN_ON_ONCE(!task_on_rq_queued(p));
+
+ /*
+ * Remove task from the pushable list as we try only once after
+ * task has been put back in enqueued list.
+ */
+ plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
+
+ return p;
+}
+
+/*
+ * See if the non running fair tasks on this rq
+ * can be sent to some other CPU that fits better with
+ * their profile.
+ */
+static int push_fair_task(struct rq *rq)
+{
+ struct task_struct *next_task;
+ struct rq *new_rq;
+ int prev_cpu, new_cpu;
+ int ret = 0;
+
+ next_task = pick_next_pushable_fair_task(rq);
+ if (!next_task)
+ return 0;
+
+ if (is_migration_disabled(next_task))
+ return 0;
+
+ if (WARN_ON(next_task == rq->curr))
+ return 0;
+
+ /* We might release rq lock */
+ get_task_struct(next_task);
+
+ prev_cpu = rq->cpu;
+
+ new_cpu = find_energy_efficient_cpu(next_task, prev_cpu);
+
+ if (new_cpu == prev_cpu)
+ goto out;
+
+ new_rq = cpu_rq(new_cpu);
+
+ if (double_lock_balance(rq, new_rq)) {


I think it might be necessary to check the following:
if (task_cpu(next_task) != rq->cpu) {
double_unlock_balance(rq, new_rq);

Yes good point

goto out;
}

Indeed I've been hitting the following warnings:
- uclamp_rq_dec_id():SCHED_WARN_ON(!bucket->tasks)
- set_task_cpu()::WARN_ON_ONCE(state == TASK_RUNNING &&
p->sched_class == &fair_sched_class &&
(p->on_rq && !task_on_rq_migrating(p)))
- update_entity_lag()::SCHED_WARN_ON(!se->on_rq)

and it seemed to be caused by the task not being on the initial rq anymore.

Do you have a particular use case to trigger this ? I haven't faced
this in the various stress tests that I did

It was triggered by this workload, but it was on a Pixel6 device,
so there might be more background activity:
- 8 tasks with: [UCLAMP_MIN:0, UCLAMP_MAX:1, duty_cycle:100%, period:16ms]
- 4 tasks each bound to a little CPU with: [UCLAMP_MIN:0, UCLAMP_MAX:1024, duty_cycle:4%, period:4ms]

Regards,
Pierre