Re: [PATCH 2/6] sched: rework of sched_domain topology definition

From: Vincent Guittot
Date: Thu Mar 06 2014 - 03:33:06 EST


On 6 March 2014 01:09, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
> On 05/03/14 07:18, Vincent Guittot wrote:
>>
>> We replace the old way to configure the scheduler topology with a new
>> method
>> which enables a platform to declare additionnal level (if needed).
>>
>> We still have a default topology table definition that can be used by
>> platform
>> that don't want more level than the SMT, MC, CPU and NUMA ones. This table
>> can
>> be overwritten by an arch which wants to add new level where a load
>> balance
>> make sense like BOOK or powergating level.
>>
>> For each level, we need a function pointer that returns cpumask for each
>> cpu,
>> the flags configuration and a name. Each level must be a subset on
>
>
> Maybe it's worth mentioning here that those flags are from the set of the sd
> topology flags to distinguish them from the set of sd behavioural flags.
> Latter can't be set via this interface.

yes, i will add the list of flags that can be set with the table

>
>
>> the next one. The build sequence of the sched_domain will take care of
>> removing useless levels like those with 1 CPU and those with the same CPU
>> span
>> and relevant information for load balancing than its child .
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
>> ---
>> arch/ia64/include/asm/topology.h | 24 ----
>> arch/s390/include/asm/topology.h | 2 -
>> arch/tile/include/asm/topology.h | 33 ------
>> include/linux/sched.h | 29 +++++
>> include/linux/topology.h | 128 +++------------------
>> kernel/sched/core.c | 227
>> +++++++++++++++++++-------------------
>> 6 files changed, 156 insertions(+), 287 deletions(-)
>>

[snip]

>> -
>> -#define for_each_sd_topology(tl) \
>> - for (tl = sched_domain_topology; tl->init; tl++)
>
>
> Why is sched_domains_curr_level now outside #ifdef CONFIG_NUMA?

it should not as well as its use in sd_init

>
>
>> +static int sched_domains_curr_level;
>>
>> #ifdef CONFIG_NUMA
>> -
>> static int sched_domains_numa_levels;
>> static int *sched_domains_numa_distance;
>> static struct cpumask ***sched_domains_numa_masks;
>> -static int sched_domains_curr_level;
>> -
>> -static inline int sd_local_flags(int level)
>> -{
>> - if (sched_domains_numa_distance[level] > RECLAIM_DISTANCE)
>> - return 0;
>> +#endif
>>
>> - return SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE;
>> -}
>> +/*
>> + * SD_flags allowed in topology descriptions.
>> + *
>> + * SD_SHARE_CPUPOWER - describes SMT topologies
>> + * SD_SHARE_PKG_RESOURCES - describes shared caches
>> + * SD_NUMA - describes NUMA topologies
>> + *
>> + * Odd one out:
>> + * SD_ASYM_PACKING - describes SMT quirks
>> + */
>> +#define TOPOLOGY_SD_FLAGS \
>> + (SD_SHARE_CPUPOWER | \
>> + SD_SHARE_PKG_RESOURCES | \
>> + SD_NUMA | \
>> + SD_ASYM_PACKING)
>>
>> static struct sched_domain *
>> -sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
>> +sd_init(struct sched_domain_topology_level *tl, int cpu)
>> {
>> struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu);
>> - int level = tl->numa_level;
>> - int sd_weight = cpumask_weight(
>> -
>> sched_domains_numa_masks[level][cpu_to_node(cpu)]);
>> + int sd_weight;
>> +
>
>
> Next line could be guared by #ifdef CONFIG_NUMA. We still use #ifdef
> CONFIG_NUMA later in sd_init() though.
>
>> + /*
>> + * Ugly hack to pass state to sd_numa_mask()...
>> + */
>> + sched_domains_curr_level = tl->numa_level;
>> +
>> + sd_weight = cpumask_weight(tl->mask(cpu));
>> +
>> + if (WARN_ONCE(tl->sd_flags & ~TOPOLOGY_SD_FLAGS,
>> + "wrong sd_flags in topology description\n"))
>> + tl->sd_flags &= ~TOPOLOGY_SD_FLAGS;
>>
>> *sd = (struct sched_domain){
>> .min_interval = sd_weight,
>> .max_interval = 2*sd_weight,
>> .busy_factor = 32,
>> .imbalance_pct = 125,
>> - .cache_nice_tries = 2,
>> - .busy_idx = 3,
>> - .idle_idx = 2,
>> +
>> + .cache_nice_tries = 0,
>> + .busy_idx = 0,
>> + .idle_idx = 0,
>> .newidle_idx = 0,
>> .wake_idx = 0,
>> .forkexec_idx = 0,
>
>
> Why we want to explicitly set those indexes to 0 here? IMHO, the memory for
> *sd is zeroed out before. This is true for all data members which are set to
> 0 later in this function including the | 0*SD_FOO . IMHO, would make the
> code more readable.

I would say that it makes the configuration more readable and
modifiable because you have the list of possible flag to set

>
>
>>
>> .flags = 1*SD_LOAD_BALANCE
>> | 1*SD_BALANCE_NEWIDLE
>> - | 0*SD_BALANCE_EXEC
>> - | 0*SD_BALANCE_FORK
>> + | 1*SD_BALANCE_EXEC
>> + | 1*SD_BALANCE_FORK
>> | 0*SD_BALANCE_WAKE
>> - | 0*SD_WAKE_AFFINE
>> + | 1*SD_WAKE_AFFINE
>> | 0*SD_SHARE_CPUPOWER
>> | 0*SD_SHARE_PKG_RESOURCES
>> - | 1*SD_SERIALIZE
>> + | 0*SD_SERIALIZE
>> | 0*SD_PREFER_SIBLING
>> - | 1*SD_NUMA
>> - | sd_local_flags(level)
>> + | 0*SD_NUMA
>> + | tl->sd_flags
>> ,
>> +
>> .last_balance = jiffies,
>> .balance_interval = sd_weight,
>> + .smt_gain = 0,
>> + .max_newidle_lb_cost = 0,
>> + .next_decay_max_lb_cost = jiffies,
>> +#ifdef CONFIG_SCHED_DEBUG
>> + .name = tl->name,
>> +#endif
>> };
>> - SD_INIT_NAME(sd, NUMA);
>> - sd->private = &tl->data;
>>

[snip]

>>
>> +/*
>> + * Topology list, bottom-up.
>> + */
>> +static struct sched_domain_topology_level default_topology[] = {
>> +#ifdef CONFIG_SCHED_SMT
>> + { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
>> SD_INIT_NAME(SMT) },
>> +#endif
>> +#ifdef CONFIG_SCHED_MC
>> + { cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(MC) },
>> +#endif
>> +#ifdef CONFIG_SCHED_BOOK
>> + { cpu_book_mask, SD_INIT_NAME(BOOK) },
>> +#endif
>
>
> Never got the new name DIE for CPU? Might confuse people when they use
> /proc/sys/kernel/sched_domain/cpuX/domainY/name or sched_domain_debug_one().

In fact, CPU is also confusing because it's used for different things.
But if it makes things even more confusing, i can come back to CPU

>
>
>> + { cpu_cpu_mask, SD_INIT_NAME(DIE) },
>> + { NULL, },
>> +};
>> +
>> +struct sched_domain_topology_level *sched_domain_topology =
>> default_topology;
>> +
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/