Re: [PATCH 3/3] sched: limit cpu search and rotate search window for scalability

From: Peter Zijlstra
Date: Wed Apr 25 2018 - 14:02:08 EST


On Wed, Apr 25, 2018 at 05:36:00PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 24, 2018 at 05:10:34PM -0700, Subhra Mazumdar wrote:
> > On 04/24/2018 05:53 AM, Peter Zijlstra wrote:
>
> > > Why do you need to put a max on? Why isn't the proportional thing
> > > working as is? (is the average no good because of big variance or what)
>
> > Firstly the choosing of 512 seems arbitrary.
>
> It is; it is a crud attempt to deal with big variance. The comment says
> as much.
>
> > Secondly the logic here is that the enqueuing cpu should search up to
> > time it can get work itself. Why is that the optimal amount to
> > search?
>
> 1/512-th of the time in fact, per the above random number, but yes.
> Because searching for longer than we're expecting to be idle for is
> clearly bad, at that point we're inhibiting doing useful work.
>
> But while thinking about all this, I think I've spotted a few more
> issues, aside from the variance:
>
> Firstly, while avg_idle estimates the average duration for _when_ we go
> idle, it doesn't give a good measure when we do not in fact go idle. So
> when we get wakeups while fully busy, avg_idle is a poor measure.
>
> Secondly, the number of wakeups performed is also important. If we have
> a lot of wakeups, we need to look at aggregate wakeup time over a
> period. Not just single wakeup time.
>
> And thirdly, we're sharing the idle duration with newidle balance.
>
> And I think the 512 is a result of me not having recognised these
> additional issues when looking at the traces, I saw variance and left it
> there.
>
>
> This leaves me thinking we need a better estimator for wakeups. Because
> if there really is significant idle time, not looking for idle CPUs to
> run on is bad. Placing that upper limit, especially such a low one, is
> just an indication of failure.
>
>
> I'll see if I can come up with something.

Something like so _could_ work. Again, completely untested. We give idle
time to wake_avg, we subtract select_idle_sibling 'runtime' from
wake_avg, such that when there's lots of wakeups we don't use more time
than there was reported idle time. And we age wake_avg, such that if
there hasn't been idle for a number of ticks (we've been real busy) we
also stop scanning wide.

But it could eat your granny and set your cat on fire :-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5e10aaeebfcc..bc910e5776cc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1671,6 +1671,9 @@ static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags,
if (rq->avg_idle > max)
rq->avg_idle = max;

+ rq->wake_stamp = jiffies;
+ rq->wake_avg = rq->avg_idle / 2;
+
rq->idle_stamp = 0;
}
#endif
@@ -6072,6 +6075,8 @@ void __init sched_init(void)
rq->online = 0;
rq->idle_stamp = 0;
rq->avg_idle = 2*sysctl_sched_migration_cost;
+ rq->wake_stamp = jiffies;
+ rq->wake_avg = rq->avg_idle;
rq->max_idle_balance_cost = sysctl_sched_migration_cost;

INIT_LIST_HEAD(&rq->cfs_tasks);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 54dc31e7ab9b..fee31dbe15ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6369,7 +6369,9 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
{
struct sched_domain *this_sd;
+ unsigned long now = jiffies;
u64 avg_cost, avg_idle;
+ struct rq *this_rq;
u64 time, cost;
s64 delta;
int cpu, nr = INT_MAX;
@@ -6378,11 +6380,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
if (!this_sd)
return -1;

- /*
- * Due to large variance we need a large fuzz factor; hackbench in
- * particularly is sensitive here.
- */
- avg_idle = this_rq()->avg_idle / 512;
+ this_rq = this_rq();
+ if (sched_feat(SIS_NEW)) {
+ /* age the remaining idle time */
+ if (unlikely(this_rq->wake_stamp < now)) {
+ while (this_rq->wake_stamp < now && this_rq->wake_avg)
+ this_rq->wake_avg >>= 1;
+ }
+ avg_idle = this_rq->wake_avg;
+ } else {
+ avg_idle = this_rq->avg_idle / 512;
+ }
+
avg_cost = this_sd->avg_scan_cost + 1;

if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
@@ -6412,6 +6421,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
delta = (s64)(time - cost) / 8;
this_sd->avg_scan_cost += delta;

+ /* you can only spend the time once */
+ if (this_rq->wake_avg > time)
+ this_rq->wake_avg -= time;
+ else
+ this_rq->wake_avg = 0;
+
return cpu;
}

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 85ae8488039c..f5f86a59aac4 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
*/
SCHED_FEAT(SIS_AVG_CPU, false)
SCHED_FEAT(SIS_PROP, true)
+SCHED_FEAT(SIS_NEW, false)

/*
* Issue a WARN when we do multiple update_rq_clock() calls
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 15750c222ca2..c4d6ddf907b5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -831,6 +831,9 @@ struct rq {
u64 idle_stamp;
u64 avg_idle;

+ unsigned long wake_stamp;
+ u64 wake_avg;
+
/* This is used to determine avg_idle's max value */
u64 max_idle_balance_cost;
#endif