Re: [regression] cross core scheduling frequency drop bisected to 0c313cb20732
From: Mike Galbraith
Date: Sun Apr 10 2016 - 05:35:49 EST
On Sun, 2016-04-10 at 05:44 +0200, Rafael J. Wysocki wrote:
> On Sat, Apr 9, 2016 at 6:39 PM, Mike Galbraith <
> umgwanakikbuti@xxxxxxxxx> wrote:
> >
> > Hm, setting gov=performance, and taking the average of 3 30 second
> > interval PkgWatt samples as pipe-test runs..
> >
> > 714KHz/28.03Ws = 25.46
> > 877KHz/30.28Ws = 28.96
> >
> > ..for pipe-test, the tradeoff look a bit more like red than green.
>
> Well, fair enough, but that's just pipe-test, and what about the
> people who don't see the performance gain and see the energy loss,
> like Doug?
Perhaps Doug sees increased power because he's not throttling no_hz,
whereas I am, so he burns more power getting _to_ idle? Dunno, maybe
he'll try the attached. If it's a general case energy loser, so be it,
numbers talk, bs walks and all that ;-)
> Essentially, this trades performance gains in somewhat special
> workloads for increased energy consumption in idle. Those workloads
> need not be run by everybody, but idle is.
Cross core scheduling is routine business, we do truckloads of that for
good reason, and lots of stuff does wakeups at high frequency.
> That said I applied the patch you're complaining about mostly because
> the commit that introduced the change in question in 4.5 claimed that
> it wouldn't affect idle power on systems with reasonably fast C1, but
> that didn't pass the reality test. I'm not totally against restoring
> that change, but it would need to be based on very solid evidence.
Understood. My box seems to be saying we can hug the trees hardest by
telling the CPU get work done as quickly as possible, but I don't have
much experience at tree hugging measurement. Performance wise, tasks
talking via localhost is definitely not special.
tbench 1 2 4 8
base 752 1283 2250 3362
select_idle_sibling() off
735 1344 2080 2884
delta .977 1.047 .924 .857
select_idle_sibling() on, 0c313cb20732 reverted
816 1317 2240 3388
delta 1.085 1.026 .995 1.007 vs base
delta 1.110 .979 1.076 1.174 vs off
(^hm)
-Mikesched: ratelimit nohz
Entering nohz code on every micro-idle is too expensive to bear.
Signed-off-by: Mike Galbraith <efault@xxxxxx>
---
include/linux/sched.h | 5 +++++
kernel/sched/core.c | 8 ++++++++
kernel/time/tick-sched.c | 2 +-
3 files changed, 14 insertions(+), 1 deletion(-)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2286,6 +2286,11 @@ static inline int set_cpus_allowed_ptr(s
#ifdef CONFIG_NO_HZ_COMMON
void calc_load_enter_idle(void);
void calc_load_exit_idle(void);
+#ifdef CONFIG_SMP
+extern int sched_needs_cpu(int cpu);
+#else
+static inline int sched_needs_cpu(int cpu) { return 0; }
+#endif
#else
static inline void calc_load_enter_idle(void) { }
static inline void calc_load_exit_idle(void) { }
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -577,6 +577,14 @@ static inline bool got_nohz_idle_kick(vo
return false;
}
+int sched_needs_cpu(int cpu)
+{
+ if (tick_nohz_full_cpu(cpu))
+ return 0;
+
+ return cpu_rq(cpu)->avg_idle < sysctl_sched_migration_cost;
+}
+
#else /* CONFIG_NO_HZ_COMMON */
static inline bool got_nohz_idle_kick(void)
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -676,7 +676,7 @@ static ktime_t tick_nohz_stop_sched_tick
} while (read_seqretry(&jiffies_lock, seq));
ts->last_jiffies = basejiff;
- if (rcu_needs_cpu(basemono, &next_rcu) ||
+ if (sched_needs_cpu(cpu) || rcu_needs_cpu(basemono, &next_rcu) ||
arch_needs_cpu() || irq_work_needs_cpu()) {
next_tick = basemono + TICK_NSEC;
} else {