[RFC][PATCH 11/14] sched/fair: Synchonous PELT detach on load-balance migrate

From: Peter Zijlstra
Date: Fri May 12 2017 - 13:21:25 EST


Vincent wondered why his self migrating task had a roughly 50% dip in
load_avg when landing on the new CPU. This is because we uncondionally
take the asynchronous detatch_entity route, which can lead to the
attach on the new CPU still seeing the old CPU's contribution to
tg->load_avg, effectively halving the new CPU's shares.

While in general this is something we have to live with, there is the
special case of runnable migration where we can do better.

Tested-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
---
kernel/sched/fair.c | 33 +++++++++++++++++++++------------
1 file changed, 21 insertions(+), 12 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3649,10 +3649,6 @@ void remove_entity_load_avg(struct sched
* Similarly for groups, they will have passed through
* post_init_entity_util_avg() before unregister_sched_fair_group()
* calls this.
- *
- * XXX in case entity_is_task(se) && task_of(se)->on_rq == MIGRATING
- * we could actually get the right time, since we're called with
- * rq->lock held, see detach_task().
*/

sync_entity_load_avg(se);
@@ -6251,6 +6247,8 @@ select_task_rq_fair(struct task_struct *
return new_cpu;
}

+static void detach_entity_cfs_rq(struct sched_entity *se);
+
/*
* Called immediately before a task is migrated to a new cpu; task_cpu(p) and
* cfs_rq_of(p) references at time of call are still valid and identify the
@@ -6284,14 +6282,25 @@ static void migrate_task_rq_fair(struct
se->vruntime -= min_vruntime;
}

- /*
- * We are supposed to update the task to "current" time, then its up to date
- * and ready to go to new CPU/cfs_rq. But we have difficulty in getting
- * what current time is, so simply throw away the out-of-date time. This
- * will result in the wakee task is less decayed, but giving the wakee more
- * load sounds not bad.
- */
- remove_entity_load_avg(&p->se);
+ if (p->on_rq == TASK_ON_RQ_MIGRATING) {
+ /*
+ * In case of TASK_ON_RQ_MIGRATING we in fact hold the 'old'
+ * rq->lock and can modify state directly.
+ */
+ lockdep_assert_held(&task_rq(p)->lock);
+ detach_entity_cfs_rq(&p->se);
+
+ } else {
+ /*
+ * We are supposed to update the task to "current" time, then
+ * its up to date and ready to go to new CPU/cfs_rq. But we
+ * have difficulty in getting what current time is, so simply
+ * throw away the out-of-date time. This will result in the
+ * wakee task is less decayed, but giving the wakee more load
+ * sounds not bad.
+ */
+ remove_entity_load_avg(&p->se);
+ }

/* Tell new CPU we are migrated */
p->se.avg.last_update_time = 0;