Re: [RFCv3 PATCH 30/48] sched: Calculate energy consumption of sched_group

From: Dietmar Eggemann
Date: Thu Mar 26 2015 - 11:23:49 EST


On 24/03/15 17:39, Morten Rasmussen wrote:
> On Tue, Mar 24, 2015 at 04:10:37PM +0000, Peter Zijlstra wrote:
>> On Tue, Mar 24, 2015 at 10:44:24AM +0000, Morten Rasmussen wrote:
>>>>> Maybe remind us why this needs to be tied to sched_groups ? Why can't we
>>>>> attach the energy information to the domains?
>>
>>> In the current domain hierarchy you don't have domains with just one cpu
>>> in them. If you attach the per-cpu energy data to the MC level domain
>>> which spans the whole cluster, you break the current idea of attaching
>>> information to the cpumask (currently sched_group, but could be
>>> sched_domain as we discuss here) the information is associated with. You
>>> would have to either introduce a level of single cpu domains at the
>>> lowest level or move away from the idea of attaching data to the cpumask
>>> that is associated with it.
>>>
>>> Using sched_groups we do already have single cpu groups that we can
>>> attach per-cpu data to, but we are missing a top level group spanning
>>> the entire system for system wide energy data. So from that point of
>>> view groups and domains are equally bad.
>>
>> Oh urgh, good point that. Cursed if you do, cursed if you don't. Bugger.
>
> Yeah :( I don't really care which one we choose. Adding another top
> level domain with one big group spanning all cpus, but with all SD flags
> disabled seems less intrusive than adding a level at the bottom.
>
> Better ideas are very welcome.
>

I had a stab at integrating such a top level (SYS) domain w/ all known SD
flags disabled. This SYS sd exposes itself w/ all counters set to 0 in
/proc/schedstat.

There're still some kludges in the patch blow:

- The need for a new topology SD flag to tell sd_init() that we want to
reset the default sd configuration.
- Don't break in build_sched_domains() at the first sd spanning cpu_map
- Don't decay newidle max times in rebalance_domains() by bailing early
on SYS sd.

It survived booting on single (MC-SYS) and dual cluster ARM (MC-DIE-SYS)
systems.
Would something like this be acceptable?

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f984b4e58865..8fbc9976f5d1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -904,6 +904,7 @@ enum cpu_idle_type {
#define SD_BALANCE_FORK 0x0008 /* Balance on fork, clone */
#define SD_BALANCE_WAKE 0x0010 /* Balance on wakeup */
#define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */
+#define SD_SHARE_ENERGY 0x0040 /* System-wide energy data */
#define SD_SHARE_CPUCAPACITY 0x0080 /* Domain members share cpu power */
#define SD_SHARE_POWERDOMAIN 0x0100 /* Domain members share power domain */
#define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4f52c2e7484e..d058dc1e639f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5529,7 +5529,7 @@ static int sd_degenerate(struct sched_domain *sd)
}

/* Following flags don't use groups */
- if (sd->flags & (SD_WAKE_AFFINE))
+ if (sd->flags & (SD_WAKE_AFFINE | SD_SHARE_ENERGY))
return 0;

return 1;
@@ -6215,8 +6215,9 @@ static int sched_domains_curr_level;
* SD_SHARE_POWERDOMAIN - describes shared power domain
* SD_SHARE_CAP_STATES - describes shared capacity states
*
- * Odd one out:
+ * Odd two out:
* SD_ASYM_PACKING - describes SMT quirks
+ * SD_SHARE_ENERGY - describes EAS quirks
*/
#define TOPOLOGY_SD_FLAGS \
(SD_SHARE_CPUCAPACITY | \
@@ -6224,7 +6225,8 @@ static int sched_domains_curr_level;
SD_NUMA | \
SD_ASYM_PACKING | \
SD_SHARE_POWERDOMAIN | \
- SD_SHARE_CAP_STATES)
+ SD_SHARE_CAP_STATES | \
+ SD_SHARE_ENERGY)

static struct sched_domain *
sd_init(struct sched_domain_topology_level *tl, int cpu)
@@ -6298,6 +6300,14 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
sd->cache_nice_tries = 1;
sd->busy_idx = 2;

+ } else if (sd->flags & SD_SHARE_ENERGY) {
+ /* Reset the default configuration completely */
+ memset(sd, 0, sizeof(*sd));
+
+ sd->flags = 1*SD_SHARE_ENERGY;
+#ifdef CONFIG_SCHED_DEBUG
+ sd->name = tl->name;
+#endif
#ifdef CONFIG_NUMA
} else if (sd->flags & SD_NUMA) {
sd->cache_nice_tries = 2;
@@ -6826,8 +6836,6 @@ static int build_sched_domains(const struct cpumask *cpu_map,
*per_cpu_ptr(d.sd, i) = sd;
if (tl->flags & SDTL_OVERLAP || sched_feat(FORCE_SD_OVERLAP))
sd->flags |= SD_OVERLAP;
- if (cpumask_equal(cpu_map, sched_domain_span(sd)))
- break;
}
}

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cfe65aec3237..8d4cc72f4778 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8073,6 +8073,10 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)

rcu_read_lock();
for_each_domain(cpu, sd) {
+
+ if (sd->flags & SD_SHARE_ENERGY)
+ continue;
+
/*
* Decay the newidle max times here because this is a regular
* visit to all the domains. Decay ~1% per second.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/