Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()

From: Mathieu Desnoyers
Date: Mon Sep 11 2023 - 16:49:39 EST


On 9/11/23 11:26, Mathieu Desnoyers wrote:
On 9/10/23 22:50, Chen Yu wrote:
[...]
---
  kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
  kernel/sched/features.h |  1 +
  kernel/sched/sched.h    |  1 +
  3 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e20f50726ab8..fe3b760c9654 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
      hrtick_update(rq);
      now = sched_clock_cpu(cpu_of(rq));
      p->se.prev_sleep_time = task_sleep ? now : 0;
+#ifdef CONFIG_SMP
+    /*
+     * If this rq will become idle, and dequeued task is
+     * a short sleeping one, check if we can reserve
+     * this idle CPU for that task for a short while.
+     * During this reservation period, other wakees will
+     * skip this 'idle' CPU in select_idle_cpu(), and this
+     * short sleeping task can pick its previous CPU in
+     * select_idle_sibling(), which brings better cache
+     * locality.
+     */
+    if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
+        p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost)
+        rq->cache_hot_timeout = now + p->se.sleep_avg;

This is really cool!

There is one scenario that worries me with this approach: workloads
that sleep for a long time and then have short blocked periods.
Those bursts will likely bring the average to values too high
to stay below sysctl_sched_migration_cost.

I wonder if changing the code above for the following would help ?

if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.sleep_avg)
    rq->cache_hot_timeout = now + min(sysctl_sched_migration_cost, p->se.sleep_avg);

For tasks that have a large sleep_avg, it would activate this rq
"appear as not idle for rq selection" scheme for a window of
sysctl_sched_migration_cost. If the sleep ends up being a long one,
preventing other tasks from being migrated to this rq for a tiny
window should not matter performance-wise. I would expect that it
could help workloads that come in bursts.

There is perhaps a better way to handle bursts:

When calculating the sleep_avg, we actually only really care about
the sleep time for short bursts, so we could use the sysctl_sched_migration_cost
to select which of the sleeps should be taken into account in the avg.

We could rename the field "sleep_avg" to "burst_sleep_avg", and have:

u64 now = sched_clock_cpu(task_cpu(p));

if ((flags & ENQUEUE_WAKEUP) && last_sleep && cpu_online(task_cpu(p)) &&
now > last_sleep && now - last_sleep < sysctl_sched_migration_cost)
update_avg(&p->se.burst_sleep_avg, now - last_sleep);

Then we can have this code is dequeue_task_fair:

if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.busrt_sleep_avg)
rq->cache_hot_timeout = now + p->se.burst_sleep_avg;

Thoughts ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com