Re: [PATCH v4 2/7] sched/fair: Decay task PELT values during wakeup migration

From: Vincent Donnefort
Date: Wed Apr 20 2022 - 06:27:50 EST

Next message: Thomas Zimmermann: "Re: [PATCH 2/2] drm/nvdla: Add driver support for NVDLA"
Previous message: Florian Fischer: "[PATCH v4 3/3] perf list: print all available tool events"
In reply to: Vincent Donnefort: "Re: [PATCH v4 2/7] sched/fair: Decay task PELT values during wakeup migration"
Next in thread: Vincent Guittot: "Re: [PATCH v4 2/7] sched/fair: Decay task PELT values during wakeup migration"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 20/04/2022 10:34, Vincent Donnefort wrote:

On 19/04/2022 17:27, Vincent Guittot wrote:

Le mardi 19 avril 2022 ï¿½ 13:23:27 (+0100), Vincent Donnefort a ï¿½crit :

On 19/04/2022 11:08, Vincent Guittot wrote:

On Tue, 12 Apr 2022 at 15:42, Vincent Donnefort
<vincent.donnefort@xxxxxxx> wrote:

Before being migrated to a new CPU, a task sees its PELT values
synchronized with rq last_update_time. Once done, that same task will also
have its sched_avg last_update_time reset. This means the time between
the migration and the last clock update (B) will not be accounted for in
util_avg and a discontinuity will appear. This issue is amplified by the
PELT clock scaling. If the clock hasn't been updated while the CPU is
idle, clock_pelt will not be aligned with clock_task and that time (A)
will be also lost.

     ---------|----- A -----|-----------|------- B -----|>
          clock_pelt   clock_task     clock            now

This is especially problematic for asymmetric CPU capacity systems which
need stable util_avg signals for task placement and energy estimation.

Ideally, this problem would be solved by updating the runqueue clocks
before the migration. But that would require taking the runqueue lock
which is quite expensive [1]. Instead estimate the missing time and update
the task util_avg with that value:

    A + B = clock_task - clock_pelt + sched_clock_cpu() - clock

Neither clock_task, clock_pelt nor clock can be accessed without the
runqueue lock. The new cfs_rq last_update_lag is therefore created and
contains those three values when the last_update_time value for that very
same cfs_rq is updated.

    last_update_lag = clock - clock_task + clock_pelt

And we can then write the missing time as follow:

    A + B = sched_clock_cpu() - last_update_lag

The B. part of the missing time is however an estimation that doesn't take
into account IRQ and Paravirt time.

Now we have an estimation for A + B, we can create an estimator for the
PELT value at the time of the migration. We need for this purpose to
inject last_update_time which is a combination of both clock_pelt and
lost_idle_time. The latter is a time value which is completely lost from a
PELT point of view and must be ignored. And finally, we can write:

    now = last_update_time + A + B
        = last_update_time + sched_clock_cpu() - last_update_lag

This estimation has a cost, mostly due to sched_clock_cpu(). Limit the
usage to the case where the source CPU is idle as we know this is when the
clock is having the biggest risk of being outdated.

[1] https://lore.kernel.org/all/20190709115759.10451-1-chris.redpath@xxxxxxx/

Signed-off-by: Vincent Donnefort <vincent.donnefort@xxxxxxx>

[...]

I'm worried that we will call this for each and every
update_cfs_rq_load_avg() whereas the content will be used only when
idle and not throttled. Can't we use these conditions to save values
only when needed and limit the number of useless updates ?

I don't think we can use idle here as a condition, once it is idle, it is
too late to get those clock values.

As an example, the patch below should work. It doesn't handle the throttled case yet and still has to
make sure that rq->enter_idle and rq->clock_pelt_idle are coherent in regards to ILB that
update blocked load.

I had to abandon the per-rq approach from v1 to v2. This is because of the following example:

1. task A sleep
2. rq's clock updated (e.g another task runs)
3. task A migrated

With a per-rq lag, we would miss the time delta between 1 and 2. We know how old is the last clock update. But what we actually want is how old is the task's last_update_time.

However, to go back to your idea of only updating when going idle: we could only do the update when the task is dequeued. We know that only after that dequeue we could have a wake-up migration.

Something like:

static void
dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)

...

update_load_avg(cfs_rq, se, UPDATE_TG);
+ if (entity_is_task(se))
+ update_cfs_rq_lag(cfs_rq);
se_update_runnable(se);

That would drastically reduce the number of time update_cfs_rq_lag() would be called, skipping the update during enqueue and tick.

Additionally, we could also check for nr_running in cfs_rq, to only update when cfs_rq is 'idle'?

Thoughts?

---
kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++++
kernel/sched/pelt.h | 21 ++++++++++++++-------
kernel/sched/sched.h | 3 ++-
3 files changed, 46 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e6ecf530f19f..f00843f9dd01 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7005,6 +7005,35 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
static void detach_entity_cfs_rq(struct sched_entity *se);
+#ifdef CONFIG_NO_HZ_COMMON
+static inline void migrate_se_pelt_lag(struct sched_entity *se)
+{
+       u64 now;
+       struct cfs_rq *cfs_rq;
+       struct rq *rq;
+       bool is_idle;
+
+       cfs_rq = cfs_rq_of(se);
+       rq = rq_of(cfs_rq);
+
+       rcu_read_lock();
+       is_idle = is_idle_task(rcu_dereference(rq->curr));
+       rcu_read_unlock();
+
+       if (!is_idle)
+               return;
+
+    /* TODO handle throttled cfs */
+    /* TODO handle update ilb blocked load update */
+    now = READ_ONCE(rq->clock_pelt_idle);
+    now += sched_clock_cpu(cpu_of(rq)) - READ_ONCE(rq->enter_idle);
+
+       __update_load_avg_blocked_se(now, se);
+}
+#else
+static void migrate_se_pelt_lag(struct sched_entity *se) {}
+#endif
+
/*
   * Called immediately before a task is migrated to a new CPU; task_cpu(p) and
   * cfs_rq_of(p) references at time of call are still valid and identify the
@@ -7056,6 +7085,7 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
           * sounds not bad.
           */
          remove_entity_load_avg(&p->se);
+        migrate_se_pelt_lag(&p->se);
      }
      /* Tell new CPU we are migrated */
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index c336f5f481bc..ece4423026e5 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -61,6 +61,14 @@ static inline void cfs_se_util_change(struct sched_avg *avg)
      WRITE_ONCE(avg->util_est.enqueued, enqueued);
}
+static inline u64 rq_clock_pelt(struct rq *rq)
+{
+    lockdep_assert_rq_held(rq);
+    assert_clock_updated(rq);
+
+    return rq->clock_pelt - rq->lost_idle_time;
+}
+
/*
   * The clock_pelt scales the time to reflect the effective amount of
   * computation done during the running delta time but then sync back to
@@ -78,6 +86,8 @@ static inline void update_rq_clock_pelt(struct rq *rq, s64 delta)
      if (unlikely(is_idle_task(rq->curr))) {
          /* The rq is idle, we can sync to clock_task */
          rq->clock_pelt = rq_clock_task(rq);
+        WRITE_ONCE(rq->enter_idle, rq_clock(rq)); /* this could be factorized with idle_stamp */
+        WRITE_ONCE(rq->clock_pelt_idle, rq_clock_pelt(rq)); /* last pelt clock update when idle */
          return;
      }
@@ -130,14 +140,11 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
       */
      if (util_sum >= divider)
          rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
-}
-static inline u64 rq_clock_pelt(struct rq *rq)
-{
-    lockdep_assert_rq_held(rq);
-    assert_clock_updated(rq);
-
-    return rq->clock_pelt - rq->lost_idle_time;
+    /* The rq is idle, we can sync with clock_task */
+    rq->clock_pelt = rq_clock_task(rq);
+    WRITE_ONCE(rq->enter_idle, rq_clock(rq)); /* this could be factorized with idle_stamp */
+    WRITE_ONCE(rq->clock_pelt_idle, rq_clock_pelt(rq)); /* last pelt clock update when idle */
}
#ifdef CONFIG_CFS_BANDWIDTH
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6ab77b171656..108698446762 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1007,7 +1007,8 @@ struct rq {
      u64            clock_task ____cacheline_aligned;
      u64            clock_pelt;
      unsigned long        lost_idle_time;
-
+    u64            clock_pelt_idle;
+    u64            enter_idle;
      atomic_t        nr_iowait;
#ifdef CONFIG_SCHED_DEBUG

Next message: Thomas Zimmermann: "Re: [PATCH 2/2] drm/nvdla: Add driver support for NVDLA"
Previous message: Florian Fischer: "[PATCH v4 3/3] perf list: print all available tool events"
In reply to: Vincent Donnefort: "Re: [PATCH v4 2/7] sched/fair: Decay task PELT values during wakeup migration"
Next in thread: Vincent Guittot: "Re: [PATCH v4 2/7] sched/fair: Decay task PELT values during wakeup migration"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]