On 9/10/23 22:50, Chen Yu wrote:[...]
---
kernel/sched/fair.c | 30 +++++++++++++++++++++++++++---
kernel/sched/features.h | 1 +
kernel/sched/sched.h | 1 +
3 files changed, 29 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e20f50726ab8..fe3b760c9654 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
hrtick_update(rq);
now = sched_clock_cpu(cpu_of(rq));
p->se.prev_sleep_time = task_sleep ? now : 0;
+#ifdef CONFIG_SMP
+ /*
+ * If this rq will become idle, and dequeued task is
+ * a short sleeping one, check if we can reserve
+ * this idle CPU for that task for a short while.
+ * During this reservation period, other wakees will
+ * skip this 'idle' CPU in select_idle_cpu(), and this
+ * short sleeping task can pick its previous CPU in
+ * select_idle_sibling(), which brings better cache
+ * locality.
+ */
+ if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
+ p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost)
+ rq->cache_hot_timeout = now + p->se.sleep_avg;
This is really cool!
There is one scenario that worries me with this approach: workloads
that sleep for a long time and then have short blocked periods.
Those bursts will likely bring the average to values too high
to stay below sysctl_sched_migration_cost.
I wonder if changing the code above for the following would help ?
if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.sleep_avg)
rq->cache_hot_timeout = now + min(sysctl_sched_migration_cost, p->se.sleep_avg);
For tasks that have a large sleep_avg, it would activate this rq
"appear as not idle for rq selection" scheme for a window of
sysctl_sched_migration_cost. If the sleep ends up being a long one,
preventing other tasks from being migrated to this rq for a tiny
window should not matter performance-wise. I would expect that it
could help workloads that come in bursts.