Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.

From: Subhra Mazumdar
Date: Wed Apr 10 2019 - 20:16:01 EST



On 4/9/19 11:38 AM, Julien Desfossez wrote:
We found the source of the major performance regression we discussed
previously. It turns out there was a pattern where a task (a kworker in this
case) could be woken up, but the core could still end up idle before that
task had a chance to run.

Example sequence, cpu0 and cpu1 and siblings on the same core, task1 and
task2 are in the same cgroup with the tag enabled (each following line
happens in the increasing order of time):
- task1 running on cpu0, task2 running on cpu1
- sched_waking(kworker/0, target_cpu=cpu0)
- task1 scheduled out of cpu0
- kworker/0 cannot run on cpu0 because of task2 is still running on cpu1
cpu0 is idle
- task2 scheduled out of cpu1
- cpu1 doesnât select kworker/0 for cpu0, because the optimization path ends
the task selection if core_cookie is NULL for currently selected process
and the cpu1âs runqueue.
- cpu1 is idle
--> both siblings are idle but kworker/0 is still in the run queue of cpu0.
Cpu0 may stay idle for longer if it goes deep idle.

With the fix below, we ensure to send an IPI to the sibling if it is idle
and has tasks waiting in its runqueue.
This fixes the performance issue we were seeing.

Now here is what we can measure with a disk write-intensive benchmark:
- no performance impact with enabling core scheduling without any tagged
task,
- 5% overhead if one tagged task is competing with an untagged task,
- 10% overhead if 2 tasks tagged with a different tag are competing
against each other.

We are starting more scaling tests, but this is very encouraging !


diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e1fa10561279..02c862a5e973 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3779,7 +3779,22 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
trace_printk("unconstrained pick: %s/%d %lx\n",
next->comm, next->pid, next->core_cookie);
+ rq->core_pick = NULL;
+ /*
+ * If the sibling is idling, we might want to wake it
+ * so that it can check for any runnable but blocked tasks
+ * due to previous task matching.
+ */
+ for_each_cpu(j, smt_mask) {
+ struct rq *rq_j = cpu_rq(j);
+ rq_j->core_pick = NULL;
+ if (j != cpu && is_idle_task(rq_j->curr) && rq_j->nr_running) {
+ resched_curr(rq_j);
+ trace_printk("IPI(%d->%d[%d]) idle preempt\n",
+ cpu, j, rq_j->nr_running);
+ }
+ }
goto done;
}
I see similar improvement with this patch as removing the condition I
earlier mentioned. So that's not needed. I also included the patch for the
priority fix. For 2 DB instances, HT disabling stands at -22% for 32 users
(from earlier emails).


1 DB instance

users baseline %idle core_sched %idle
16ÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂ 84ÂÂÂÂÂÂ -4.9% 84
24ÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂ 76ÂÂÂÂÂÂ -6.7% 75
32ÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂ 69ÂÂÂÂÂÂ -2.4% 69

2 DB instance

users baseline %idle core_sched %idle
16ÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂ 66ÂÂÂÂÂÂ -19.5% 69
24ÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂ 54ÂÂÂÂÂÂ -9.8% 57
32ÂÂÂÂ 1ÂÂÂÂÂÂÂÂÂ 42ÂÂÂÂÂÂ -27.2%ÂÂÂÂÂÂÂ 48