Re: [PATCH] ia64 cpuset + build_sched_domains() mangles structures

From: Dinakar Guniguntala
Date: Fri Sep 02 2005 - 09:35:31 EST


Andrew,

Please include the patch below into -mm. I had reported a problem
with this patch earlier on 2.6.13-rc6, but I am just not able to
reproduce the problem on newer kernels (2.6.13 and 2.6.13-mm1).

I have tested this extensively on a Power5 box and I believe
that John Hawke's has tested this on ia64 as well.

The patch is here

http://marc.theaimsgroup.com/?l=linux-ia64&m=112474434128996&w=2


Regards,

Dinakar



On Mon, Aug 22, 2005 at 06:07:19PM +0200, Ingo Molnar wrote:
>
> * Dinakar Guniguntala <dino@xxxxxxxxxx> wrote:
>
> > On Mon, Aug 22, 2005 at 09:08:34AM +0200, Ingo Molnar wrote:
> > >
> > > in terms of 2.6.14, the replacement patch below also does what i always
> > > wanted to do: to merge the ia64-specific build_sched_domains() code back
> > > into kernel/sched.c. I've done this by taking your improved dynamic
> > > build-domains code and putting it into kernel/sched.c.
> > >
> >
> > Ingo, one change required to your patch and the exclusive
> > cpuset functionality seems to work fine on a NUMA ppc64 box.
> > I am still running some of my dynamic sched domain tests. So far
> > it seems to be holding ok.
>
> great! Andrew, i'd suggest we try the merged patch attached below in
> -mm.
>
> > Any idea why the ia64 stuff was forked in the first place?
>
> most of the NUMA domain-trees stuff happened in the ia64 space so there
> was a natural desire to keep it more hackable there. But now i think
> it's getting counterproductive.
>
> Ingo
>
> -----
> I've already sent this to the maintainers, and this is now being sent to a
> larger community audience. I have fixed a problem with the ia64 version of
> build_sched_domains(), but a similar fix still needs to be made to the
> generic build_sched_domains() in kernel/sched.c.
>
> The "dynamic sched domains" functionality has recently been merged into
> 2.6.13-rcN that sees the dynamic declaration of a cpu-exclusive (a.k.a.
> "isolated") cpuset and rebuilds the CPU Scheduler sched domains and sched
> groups to separate away the CPUs in this cpu-exclusive cpuset from the
> remainder of the non-isolated CPUs. This allows the non-isolated CPUs to
> completely ignore the isolated CPUs when doing load-balancing.
>
> Unfortunately, build_sched_domains() expects that a sched domain will
> include all the CPUs of each node in the domain, i.e., that no node will
> belong in both an isolated cpuset and a non-isolated cpuset. Declaring
> a cpuset that violates this presumption will produce flawed data
> structures and will oops the kernel.
>
> To trigger the problem (on a NUMA system with >1 CPUs per node):
> cd /dev/cpuset
> mkdir newcpuset
> cd newcpuset
> echo 0 >cpus
> echo 0 >mems
> echo 1 >cpu_exclusive
>
> I have fixed this shortcoming for ia64 NUMA (with multiple CPUs per node).
> A similar shortcoming exists in the generic build_sched_domains() (in
> kernel/sched.c) for NUMA, and that needs to be fixed also. The fix involves
> dynamically allocating sched_group_nodes[] and sched_group_allnodes[] for
> each invocation of build_sched_domains(), rather than using global arrays
> for these structures. Care must be taken to remember kmalloc() addresses
> so that arch_destroy_sched_domains() can properly kfree() the new dynamic
> structures.
>
> This is a patch against 2.6.13-rc6.
>
> Signed-off-by: John Hawkes <hawkes@xxxxxxx>
>
> reworked the patch to also move the ia64 domain setup code to the generic
> code.
>
> Signed-off-by: Ingo Molnar <mingo@xxxxxxx>
>
> ppc64 fix
>
> From: Dinakar Guniguntala <dino@xxxxxxxxxx>
>
> arch/ia64/kernel/domain.c | 400 -------------------------------------------
> arch/ia64/kernel/Makefile | 2
> include/asm-ia64/processor.h | 3
> include/asm-ia64/topology.h | 22 --
> include/linux/sched.h | 9
> include/linux/topology.h | 22 ++
> kernel/sched.c | 290 +++++++++++++++++++++++++------
> 7 files changed, 259 insertions(+), 489 deletions(-)
>
> Index: linux-sched-curr/arch/ia64/kernel/Makefile
> ===================================================================
> --- linux-sched-curr.orig/arch/ia64/kernel/Makefile
> +++ linux-sched-curr/arch/ia64/kernel/Makefile
> @@ -16,7 +16,7 @@ obj-$(CONFIG_IA64_HP_ZX1_SWIOTLB) += acp
> obj-$(CONFIG_IA64_PALINFO) += palinfo.o
> obj-$(CONFIG_IOSAPIC) += iosapic.o
> obj-$(CONFIG_MODULES) += module.o
> -obj-$(CONFIG_SMP) += smp.o smpboot.o domain.o
> +obj-$(CONFIG_SMP) += smp.o smpboot.o
> obj-$(CONFIG_NUMA) += numa.o
> obj-$(CONFIG_PERFMON) += perfmon_default_smpl.o
> obj-$(CONFIG_IA64_CYCLONE) += cyclone.o
> Index: linux-sched-curr/arch/ia64/kernel/domain.c
> ===================================================================
> --- linux-sched-curr.orig/arch/ia64/kernel/domain.c
> +++ /dev/null
> @@ -1,400 +0,0 @@
> -/*
> - * arch/ia64/kernel/domain.c
> - * Architecture specific sched-domains builder.
> - *
> - * Copyright (C) 2004 Jesse Barnes
> - * Copyright (C) 2004 Silicon Graphics, Inc.
> - */
> -
> -#include <linux/sched.h>
> -#include <linux/percpu.h>
> -#include <linux/slab.h>
> -#include <linux/cpumask.h>
> -#include <linux/init.h>
> -#include <linux/topology.h>
> -#include <linux/nodemask.h>
> -
> -#define SD_NODES_PER_DOMAIN 16
> -
> -#ifdef CONFIG_NUMA
> -/**
> - * find_next_best_node - find the next node to include in a sched_domain
> - * @node: node whose sched_domain we're building
> - * @used_nodes: nodes already in the sched_domain
> - *
> - * Find the next node to include in a given scheduling domain. Simply
> - * finds the closest node not already in the @used_nodes map.
> - *
> - * Should use nodemask_t.
> - */
> -static int find_next_best_node(int node, unsigned long *used_nodes)
> -{
> - int i, n, val, min_val, best_node = 0;
> -
> - min_val = INT_MAX;
> -
> - for (i = 0; i < MAX_NUMNODES; i++) {
> - /* Start at @node */
> - n = (node + i) % MAX_NUMNODES;
> -
> - if (!nr_cpus_node(n))
> - continue;
> -
> - /* Skip already used nodes */
> - if (test_bit(n, used_nodes))
> - continue;
> -
> - /* Simple min distance search */
> - val = node_distance(node, n);
> -
> - if (val < min_val) {
> - min_val = val;
> - best_node = n;
> - }
> - }
> -
> - set_bit(best_node, used_nodes);
> - return best_node;
> -}
> -
> -/**
> - * sched_domain_node_span - get a cpumask for a node's sched_domain
> - * @node: node whose cpumask we're constructing
> - * @size: number of nodes to include in this span
> - *
> - * Given a node, construct a good cpumask for its sched_domain to span. It
> - * should be one that prevents unnecessary balancing, but also spreads tasks
> - * out optimally.
> - */
> -static cpumask_t sched_domain_node_span(int node)
> -{
> - int i;
> - cpumask_t span, nodemask;
> - DECLARE_BITMAP(used_nodes, MAX_NUMNODES);
> -
> - cpus_clear(span);
> - bitmap_zero(used_nodes, MAX_NUMNODES);
> -
> - nodemask = node_to_cpumask(node);
> - cpus_or(span, span, nodemask);
> - set_bit(node, used_nodes);
> -
> - for (i = 1; i < SD_NODES_PER_DOMAIN; i++) {
> - int next_node = find_next_best_node(node, used_nodes);
> - nodemask = node_to_cpumask(next_node);
> - cpus_or(span, span, nodemask);
> - }
> -
> - return span;
> -}
> -#endif
> -
> -/*
> - * At the moment, CONFIG_SCHED_SMT is never defined, but leave it in so we
> - * can switch it on easily if needed.
> - */
> -#ifdef CONFIG_SCHED_SMT
> -static DEFINE_PER_CPU(struct sched_domain, cpu_domains);
> -static struct sched_group sched_group_cpus[NR_CPUS];
> -static int cpu_to_cpu_group(int cpu)
> -{
> - return cpu;
> -}
> -#endif
> -
> -static DEFINE_PER_CPU(struct sched_domain, phys_domains);
> -static struct sched_group sched_group_phys[NR_CPUS];
> -static int cpu_to_phys_group(int cpu)
> -{
> -#ifdef CONFIG_SCHED_SMT
> - return first_cpu(cpu_sibling_map[cpu]);
> -#else
> - return cpu;
> -#endif
> -}
> -
> -#ifdef CONFIG_NUMA
> -/*
> - * The init_sched_build_groups can't handle what we want to do with node
> - * groups, so roll our own. Now each node has its own list of groups which
> - * gets dynamically allocated.
> - */
> -static DEFINE_PER_CPU(struct sched_domain, node_domains);
> -static struct sched_group *sched_group_nodes[MAX_NUMNODES];
> -
> -static DEFINE_PER_CPU(struct sched_domain, allnodes_domains);
> -static struct sched_group sched_group_allnodes[MAX_NUMNODES];
> -
> -static int cpu_to_allnodes_group(int cpu)
> -{
> - return cpu_to_node(cpu);
> -}
> -#endif
> -
> -/*
> - * Build sched domains for a given set of cpus and attach the sched domains
> - * to the individual cpus
> - */
> -void build_sched_domains(const cpumask_t *cpu_map)
> -{
> - int i;
> -
> - /*
> - * Set up domains for cpus specified by the cpu_map.
> - */
> - for_each_cpu_mask(i, *cpu_map) {
> - int group;
> - struct sched_domain *sd = NULL, *p;
> - cpumask_t nodemask = node_to_cpumask(cpu_to_node(i));
> -
> - cpus_and(nodemask, nodemask, *cpu_map);
> -
> -#ifdef CONFIG_NUMA
> - if (num_online_cpus()
> - > SD_NODES_PER_DOMAIN*cpus_weight(nodemask)) {
> - sd = &per_cpu(allnodes_domains, i);
> - *sd = SD_ALLNODES_INIT;
> - sd->span = *cpu_map;
> - group = cpu_to_allnodes_group(i);
> - sd->groups = &sched_group_allnodes[group];
> - p = sd;
> - } else
> - p = NULL;
> -
> - sd = &per_cpu(node_domains, i);
> - *sd = SD_NODE_INIT;
> - sd->span = sched_domain_node_span(cpu_to_node(i));
> - sd->parent = p;
> - cpus_and(sd->span, sd->span, *cpu_map);
> -#endif
> -
> - p = sd;
> - sd = &per_cpu(phys_domains, i);
> - group = cpu_to_phys_group(i);
> - *sd = SD_CPU_INIT;
> - sd->span = nodemask;
> - sd->parent = p;
> - sd->groups = &sched_group_phys[group];
> -
> -#ifdef CONFIG_SCHED_SMT
> - p = sd;
> - sd = &per_cpu(cpu_domains, i);
> - group = cpu_to_cpu_group(i);
> - *sd = SD_SIBLING_INIT;
> - sd->span = cpu_sibling_map[i];
> - cpus_and(sd->span, sd->span, *cpu_map);
> - sd->parent = p;
> - sd->groups = &sched_group_cpus[group];
> -#endif
> - }
> -
> -#ifdef CONFIG_SCHED_SMT
> - /* Set up CPU (sibling) groups */
> - for_each_cpu_mask(i, *cpu_map) {
> - cpumask_t this_sibling_map = cpu_sibling_map[i];
> - cpus_and(this_sibling_map, this_sibling_map, *cpu_map);
> - if (i != first_cpu(this_sibling_map))
> - continue;
> -
> - init_sched_build_groups(sched_group_cpus, this_sibling_map,
> - &cpu_to_cpu_group);
> - }
> -#endif
> -
> - /* Set up physical groups */
> - for (i = 0; i < MAX_NUMNODES; i++) {
> - cpumask_t nodemask = node_to_cpumask(i);
> -
> - cpus_and(nodemask, nodemask, *cpu_map);
> - if (cpus_empty(nodemask))
> - continue;
> -
> - init_sched_build_groups(sched_group_phys, nodemask,
> - &cpu_to_phys_group);
> - }
> -
> -#ifdef CONFIG_NUMA
> - init_sched_build_groups(sched_group_allnodes, *cpu_map,
> - &cpu_to_allnodes_group);
> -
> - for (i = 0; i < MAX_NUMNODES; i++) {
> - /* Set up node groups */
> - struct sched_group *sg, *prev;
> - cpumask_t nodemask = node_to_cpumask(i);
> - cpumask_t domainspan;
> - cpumask_t covered = CPU_MASK_NONE;
> - int j;
> -
> - cpus_and(nodemask, nodemask, *cpu_map);
> - if (cpus_empty(nodemask))
> - continue;
> -
> - domainspan = sched_domain_node_span(i);
> - cpus_and(domainspan, domainspan, *cpu_map);
> -
> - sg = kmalloc(sizeof(struct sched_group), GFP_KERNEL);
> - sched_group_nodes[i] = sg;
> - for_each_cpu_mask(j, nodemask) {
> - struct sched_domain *sd;
> - sd = &per_cpu(node_domains, j);
> - sd->groups = sg;
> - if (sd->groups == NULL) {
> - /* Turn off balancing if we have no groups */
> - sd->flags = 0;
> - }
> - }
> - if (!sg) {
> - printk(KERN_WARNING
> - "Can not alloc domain group for node %d\n", i);
> - continue;
> - }
> - sg->cpu_power = 0;
> - sg->cpumask = nodemask;
> - cpus_or(covered, covered, nodemask);
> - prev = sg;
> -
> - for (j = 0; j < MAX_NUMNODES; j++) {
> - cpumask_t tmp, notcovered;
> - int n = (i + j) % MAX_NUMNODES;
> -
> - cpus_complement(notcovered, covered);
> - cpus_and(tmp, notcovered, *cpu_map);
> - cpus_and(tmp, tmp, domainspan);
> - if (cpus_empty(tmp))
> - break;
> -
> - nodemask = node_to_cpumask(n);
> - cpus_and(tmp, tmp, nodemask);
> - if (cpus_empty(tmp))
> - continue;
> -
> - sg = kmalloc(sizeof(struct sched_group), GFP_KERNEL);
> - if (!sg) {
> - printk(KERN_WARNING
> - "Can not alloc domain group for node %d\n", j);
> - break;
> - }
> - sg->cpu_power = 0;
> - sg->cpumask = tmp;
> - cpus_or(covered, covered, tmp);
> - prev->next = sg;
> - prev = sg;
> - }
> - prev->next = sched_group_nodes[i];
> - }
> -#endif
> -
> - /* Calculate CPU power for physical packages and nodes */
> - for_each_cpu_mask(i, *cpu_map) {
> - int power;
> - struct sched_domain *sd;
> -#ifdef CONFIG_SCHED_SMT
> - sd = &per_cpu(cpu_domains, i);
> - power = SCHED_LOAD_SCALE;
> - sd->groups->cpu_power = power;
> -#endif
> -
> - sd = &per_cpu(phys_domains, i);
> - power = SCHED_LOAD_SCALE + SCHED_LOAD_SCALE *
> - (cpus_weight(sd->groups->cpumask)-1) / 10;
> - sd->groups->cpu_power = power;
> -
> -#ifdef CONFIG_NUMA
> - sd = &per_cpu(allnodes_domains, i);
> - if (sd->groups) {
> - power = SCHED_LOAD_SCALE + SCHED_LOAD_SCALE *
> - (cpus_weight(sd->groups->cpumask)-1) / 10;
> - sd->groups->cpu_power = power;
> - }
> -#endif
> - }
> -
> -#ifdef CONFIG_NUMA
> - for (i = 0; i < MAX_NUMNODES; i++) {
> - struct sched_group *sg = sched_group_nodes[i];
> - int j;
> -
> - if (sg == NULL)
> - continue;
> -next_sg:
> - for_each_cpu_mask(j, sg->cpumask) {
> - struct sched_domain *sd;
> - int power;
> -
> - sd = &per_cpu(phys_domains, j);
> - if (j != first_cpu(sd->groups->cpumask)) {
> - /*
> - * Only add "power" once for each
> - * physical package.
> - */
> - continue;
> - }
> - power = SCHED_LOAD_SCALE + SCHED_LOAD_SCALE *
> - (cpus_weight(sd->groups->cpumask)-1) / 10;
> -
> - sg->cpu_power += power;
> - }
> - sg = sg->next;
> - if (sg != sched_group_nodes[i])
> - goto next_sg;
> - }
> -#endif
> -
> - /* Attach the domains */
> - for_each_online_cpu(i) {
> - struct sched_domain *sd;
> -#ifdef CONFIG_SCHED_SMT
> - sd = &per_cpu(cpu_domains, i);
> -#else
> - sd = &per_cpu(phys_domains, i);
> -#endif
> - cpu_attach_domain(sd, i);
> - }
> - /*
> - * Tune cache-hot values:
> - */
> - calibrate_migration_costs();
> -}
> -/*
> - * Set up scheduler domains and groups. Callers must hold the hotplug lock.
> - */
> -void arch_init_sched_domains(const cpumask_t *cpu_map)
> -{
> - cpumask_t cpu_default_map;
> -
> - /*
> - * Setup mask for cpus without special case scheduling requirements.
> - * For now this just excludes isolated cpus, but could be used to
> - * exclude other special cases in the future.
> - */
> - cpus_andnot(cpu_default_map, *cpu_map, cpu_isolated_map);
> -
> - build_sched_domains(&cpu_default_map);
> -}
> -
> -void arch_destroy_sched_domains(const cpumask_t *cpu_map)
> -{
> -#ifdef CONFIG_NUMA
> - int i;
> - for (i = 0; i < MAX_NUMNODES; i++) {
> - cpumask_t nodemask = node_to_cpumask(i);
> - struct sched_group *oldsg, *sg = sched_group_nodes[i];
> -
> - cpus_and(nodemask, nodemask, *cpu_map);
> - if (cpus_empty(nodemask))
> - continue;
> -
> - if (sg == NULL)
> - continue;
> - sg = sg->next;
> -next_sg:
> - oldsg = sg;
> - sg = sg->next;
> - kfree(oldsg);
> - if (oldsg != sched_group_nodes[i])
> - goto next_sg;
> - sched_group_nodes[i] = NULL;
> - }
> -#endif
> -}
> -
> Index: linux-sched-curr/include/asm-ia64/processor.h
> ===================================================================
> --- linux-sched-curr.orig/include/asm-ia64/processor.h
> +++ linux-sched-curr/include/asm-ia64/processor.h
> @@ -20,9 +20,6 @@
> #include <asm/ptrace.h>
> #include <asm/ustack.h>
>
> -/* Our arch specific arch_init_sched_domain is in arch/ia64/kernel/domain.c */
> -#define ARCH_HAS_SCHED_DOMAIN
> -
> #define IA64_NUM_DBG_REGS 8
> /*
> * Limits for PMC and PMD are set to less than maximum architected values
> Index: linux-sched-curr/include/asm-ia64/topology.h
> ===================================================================
> --- linux-sched-curr.orig/include/asm-ia64/topology.h
> +++ linux-sched-curr/include/asm-ia64/topology.h
> @@ -96,28 +96,6 @@ void build_cpu_to_node_map(void);
> .nr_balance_failed = 0, \
> }
>
> -/* sched_domains SD_ALLNODES_INIT for IA64 NUMA machines */
> -#define SD_ALLNODES_INIT (struct sched_domain) { \
> - .span = CPU_MASK_NONE, \
> - .parent = NULL, \
> - .groups = NULL, \
> - .min_interval = 64, \
> - .max_interval = 64*num_online_cpus(), \
> - .busy_factor = 128, \
> - .imbalance_pct = 133, \
> - .cache_nice_tries = 1, \
> - .busy_idx = 3, \
> - .idle_idx = 3, \
> - .newidle_idx = 0, /* unused */ \
> - .wake_idx = 0, /* unused */ \
> - .forkexec_idx = 0, /* unused */ \
> - .per_cpu_gain = 100, \
> - .flags = SD_LOAD_BALANCE, \
> - .last_balance = jiffies, \
> - .balance_interval = 64, \
> - .nr_balance_failed = 0, \
> -}
> -
> #endif /* CONFIG_NUMA */
>
> #include <asm-generic/topology.h>
> Index: linux-sched-curr/include/linux/sched.h
> ===================================================================
> --- linux-sched-curr.orig/include/linux/sched.h
> +++ linux-sched-curr/include/linux/sched.h
> @@ -546,15 +546,6 @@ struct sched_domain {
>
> extern void partition_sched_domains(cpumask_t *partition1,
> cpumask_t *partition2);
> -#ifdef ARCH_HAS_SCHED_DOMAIN
> -/* Useful helpers that arch setup code may use. Defined in kernel/sched.c */
> -extern cpumask_t cpu_isolated_map;
> -extern void init_sched_build_groups(struct sched_group groups[],
> - cpumask_t span, int (*group_fn)(int cpu));
> -extern void cpu_attach_domain(struct sched_domain *sd, int cpu);
> -
> -#endif /* ARCH_HAS_SCHED_DOMAIN */
> -
> /*
> * Maximum cache size the migration-costs auto-tuning code will
> * search from:
> Index: linux-sched-curr/include/linux/topology.h
> ===================================================================
> --- linux-sched-curr.orig/include/linux/topology.h
> +++ linux-sched-curr/include/linux/topology.h
> @@ -133,6 +133,28 @@
> }
> #endif
>
> +/* sched_domains SD_ALLNODES_INIT for NUMA machines */
> +#define SD_ALLNODES_INIT (struct sched_domain) { \
> + .span = CPU_MASK_NONE, \
> + .parent = NULL, \
> + .groups = NULL, \
> + .min_interval = 64, \
> + .max_interval = 64*num_online_cpus(), \
> + .busy_factor = 128, \
> + .imbalance_pct = 133, \
> + .cache_nice_tries = 1, \
> + .busy_idx = 3, \
> + .idle_idx = 3, \
> + .newidle_idx = 0, /* unused */ \
> + .wake_idx = 0, /* unused */ \
> + .forkexec_idx = 0, /* unused */ \
> + .per_cpu_gain = 100, \
> + .flags = SD_LOAD_BALANCE, \
> + .last_balance = jiffies, \
> + .balance_interval = 64, \
> + .nr_balance_failed = 0, \
> +}
> +
> #ifdef CONFIG_NUMA
> #ifndef SD_NODE_INIT
> #error Please define an appropriate SD_NODE_INIT in include/asm/topology.h!!!
> Index: linux-sched-curr/kernel/sched.c
> ===================================================================
> --- linux-sched-curr.orig/kernel/sched.c
> +++ linux-sched-curr/kernel/sched.c
> @@ -4947,7 +4947,7 @@ static int sd_parent_degenerate(struct s
> * Attach the domain 'sd' to 'cpu' as its base domain. Callers must
> * hold the hotplug lock.
> */
> -void cpu_attach_domain(struct sched_domain *sd, int cpu)
> +static void cpu_attach_domain(struct sched_domain *sd, int cpu)
> {
> runqueue_t *rq = cpu_rq(cpu);
> struct sched_domain *tmp;
> @@ -4970,7 +4970,7 @@ void cpu_attach_domain(struct sched_doma
> }
>
> /* cpus with isolated domains */
> -cpumask_t __devinitdata cpu_isolated_map = CPU_MASK_NONE;
> +static cpumask_t __devinitdata cpu_isolated_map = CPU_MASK_NONE;
>
> /* Setup the mask of cpus configured for isolated domains */
> static int __init isolated_cpu_setup(char *str)
> @@ -4998,8 +4998,8 @@ __setup ("isolcpus=", isolated_cpu_setup
> * covered by the given span, and will set each group's ->cpumask correctly,
> * and ->cpu_power to 0.
> */
> -void init_sched_build_groups(struct sched_group groups[],
> - cpumask_t span, int (*group_fn)(int cpu))
> +static void init_sched_build_groups(struct sched_group groups[], cpumask_t span,
> + int (*group_fn)(int cpu))
> {
> struct sched_group *first = NULL, *last = NULL;
> cpumask_t covered = CPU_MASK_NONE;
> @@ -5513,12 +5513,85 @@ void __devinit calibrate_migration_costs
> local_irq_restore(flags);
> }
>
> +#define SD_NODES_PER_DOMAIN 16
>
> -#ifdef ARCH_HAS_SCHED_DOMAIN
> -extern void build_sched_domains(const cpumask_t *cpu_map);
> -extern void arch_init_sched_domains(const cpumask_t *cpu_map);
> -extern void arch_destroy_sched_domains(const cpumask_t *cpu_map);
> -#else
> +#ifdef CONFIG_NUMA
> +/**
> + * find_next_best_node - find the next node to include in a sched_domain
> + * @node: node whose sched_domain we're building
> + * @used_nodes: nodes already in the sched_domain
> + *
> + * Find the next node to include in a given scheduling domain. Simply
> + * finds the closest node not already in the @used_nodes map.
> + *
> + * Should use nodemask_t.
> + */
> +static int find_next_best_node(int node, unsigned long *used_nodes)
> +{
> + int i, n, val, min_val, best_node = 0;
> +
> + min_val = INT_MAX;
> +
> + for (i = 0; i < MAX_NUMNODES; i++) {
> + /* Start at @node */
> + n = (node + i) % MAX_NUMNODES;
> +
> + if (!nr_cpus_node(n))
> + continue;
> +
> + /* Skip already used nodes */
> + if (test_bit(n, used_nodes))
> + continue;
> +
> + /* Simple min distance search */
> + val = node_distance(node, n);
> +
> + if (val < min_val) {
> + min_val = val;
> + best_node = n;
> + }
> + }
> +
> + set_bit(best_node, used_nodes);
> + return best_node;
> +}
> +
> +/**
> + * sched_domain_node_span - get a cpumask for a node's sched_domain
> + * @node: node whose cpumask we're constructing
> + * @size: number of nodes to include in this span
> + *
> + * Given a node, construct a good cpumask for its sched_domain to span. It
> + * should be one that prevents unnecessary balancing, but also spreads tasks
> + * out optimally.
> + */
> +static cpumask_t sched_domain_node_span(int node)
> +{
> + int i;
> + cpumask_t span, nodemask;
> + DECLARE_BITMAP(used_nodes, MAX_NUMNODES);
> +
> + cpus_clear(span);
> + bitmap_zero(used_nodes, MAX_NUMNODES);
> +
> + nodemask = node_to_cpumask(node);
> + cpus_or(span, span, nodemask);
> + set_bit(node, used_nodes);
> +
> + for (i = 1; i < SD_NODES_PER_DOMAIN; i++) {
> + int next_node = find_next_best_node(node, used_nodes);
> + nodemask = node_to_cpumask(next_node);
> + cpus_or(span, span, nodemask);
> + }
> +
> + return span;
> +}
> +#endif
> +
> +/*
> + * At the moment, CONFIG_SCHED_SMT is never defined, but leave it in so we
> + * can switch it on easily if needed.
> + */
> #ifdef CONFIG_SCHED_SMT
> static DEFINE_PER_CPU(struct sched_domain, cpu_domains);
> static struct sched_group sched_group_cpus[NR_CPUS];
> @@ -5540,44 +5613,28 @@ static int cpu_to_phys_group(int cpu)
> }
>
> #ifdef CONFIG_NUMA
> -
> +/*
> + * The init_sched_build_groups can't handle what we want to do with node
> + * groups, so roll our own. Now each node has its own list of groups which
> + * gets dynamically allocated.
> + */
> static DEFINE_PER_CPU(struct sched_domain, node_domains);
> -static struct sched_group sched_group_nodes[MAX_NUMNODES];
> -static int cpu_to_node_group(int cpu)
> +static struct sched_group *sched_group_nodes[MAX_NUMNODES];
> +
> +static DEFINE_PER_CPU(struct sched_domain, allnodes_domains);
> +static struct sched_group sched_group_allnodes[MAX_NUMNODES];
> +
> +static int cpu_to_allnodes_group(int cpu)
> {
> return cpu_to_node(cpu);
> }
> #endif
>
> -#if defined(CONFIG_SCHED_SMT) && defined(CONFIG_NUMA)
> -/*
> - * The domains setup code relies on siblings not spanning
> - * multiple nodes. Make sure the architecture has a proper
> - * siblings map:
> - */
> -static void check_sibling_maps(void)
> -{
> - int i, j;
> -
> - for_each_online_cpu(i) {
> - for_each_cpu_mask(j, cpu_sibling_map[i]) {
> - if (cpu_to_node(i) != cpu_to_node(j)) {
> - printk(KERN_INFO "warning: CPU %d siblings map "
> - "to different node - isolating "
> - "them.\n", i);
> - cpu_sibling_map[i] = cpumask_of_cpu(i);
> - break;
> - }
> - }
> - }
> -}
> -#endif
> -
> /*
> * Build sched domains for a given set of cpus and attach the sched domains
> * to the individual cpus
> */
> -static void build_sched_domains(const cpumask_t *cpu_map)
> +void build_sched_domains(const cpumask_t *cpu_map)
> {
> int i;
>
> @@ -5592,11 +5649,22 @@ static void build_sched_domains(const cp
> cpus_and(nodemask, nodemask, *cpu_map);
>
> #ifdef CONFIG_NUMA
> + if (num_online_cpus()
> + > SD_NODES_PER_DOMAIN*cpus_weight(nodemask)) {
> + sd = &per_cpu(allnodes_domains, i);
> + *sd = SD_ALLNODES_INIT;
> + sd->span = *cpu_map;
> + group = cpu_to_allnodes_group(i);
> + sd->groups = &sched_group_allnodes[group];
> + p = sd;
> + } else
> + p = NULL;
> +
> sd = &per_cpu(node_domains, i);
> - group = cpu_to_node_group(i);
> *sd = SD_NODE_INIT;
> - sd->span = *cpu_map;
> - sd->groups = &sched_group_nodes[group];
> + sd->span = sched_domain_node_span(cpu_to_node(i));
> + sd->parent = p;
> + cpus_and(sd->span, sd->span, *cpu_map);
> #endif
>
> p = sd;
> @@ -5621,7 +5689,7 @@ static void build_sched_domains(const cp
>
> #ifdef CONFIG_SCHED_SMT
> /* Set up CPU (sibling) groups */
> - for_each_online_cpu(i) {
> + for_each_cpu_mask(i, *cpu_map) {
> cpumask_t this_sibling_map = cpu_sibling_map[i];
> cpus_and(this_sibling_map, this_sibling_map, *cpu_map);
> if (i != first_cpu(this_sibling_map))
> @@ -5646,8 +5714,74 @@ static void build_sched_domains(const cp
>
> #ifdef CONFIG_NUMA
> /* Set up node groups */
> - init_sched_build_groups(sched_group_nodes, *cpu_map,
> - &cpu_to_node_group);
> + init_sched_build_groups(sched_group_allnodes, *cpu_map,
> + &cpu_to_allnodes_group);
> +
> + for (i = 0; i < MAX_NUMNODES; i++) {
> + /* Set up node groups */
> + struct sched_group *sg, *prev;
> + cpumask_t nodemask = node_to_cpumask(i);
> + cpumask_t domainspan;
> + cpumask_t covered = CPU_MASK_NONE;
> + int j;
> +
> + cpus_and(nodemask, nodemask, *cpu_map);
> + if (cpus_empty(nodemask))
> + continue;
> +
> + domainspan = sched_domain_node_span(i);
> + cpus_and(domainspan, domainspan, *cpu_map);
> +
> + sg = kmalloc(sizeof(struct sched_group), GFP_KERNEL);
> + sched_group_nodes[i] = sg;
> + for_each_cpu_mask(j, nodemask) {
> + struct sched_domain *sd;
> + sd = &per_cpu(node_domains, j);
> + sd->groups = sg;
> + if (sd->groups == NULL) {
> + /* Turn off balancing if we have no groups */
> + sd->flags = 0;
> + }
> + }
> + if (!sg) {
> + printk(KERN_WARNING
> + "Can not alloc domain group for node %d\n", i);
> + continue;
> + }
> + sg->cpu_power = 0;
> + sg->cpumask = nodemask;
> + cpus_or(covered, covered, nodemask);
> + prev = sg;
> +
> + for (j = 0; j < MAX_NUMNODES; j++) {
> + cpumask_t tmp, notcovered;
> + int n = (i + j) % MAX_NUMNODES;
> +
> + cpus_complement(notcovered, covered);
> + cpus_and(tmp, notcovered, *cpu_map);
> + cpus_and(tmp, tmp, domainspan);
> + if (cpus_empty(tmp))
> + break;
> +
> + nodemask = node_to_cpumask(n);
> + cpus_and(tmp, tmp, nodemask);
> + if (cpus_empty(tmp))
> + continue;
> +
> + sg = kmalloc(sizeof(struct sched_group), GFP_KERNEL);
> + if (!sg) {
> + printk(KERN_WARNING
> + "Can not alloc domain group for node %d\n", j);
> + break;
> + }
> + sg->cpu_power = 0;
> + sg->cpumask = tmp;
> + cpus_or(covered, covered, tmp);
> + prev->next = sg;
> + prev = sg;
> + }
> + prev->next = sched_group_nodes[i];
> + }
> #endif
>
> /* Calculate CPU power for physical packages and nodes */
> @@ -5666,14 +5800,46 @@ static void build_sched_domains(const cp
> sd->groups->cpu_power = power;
>
> #ifdef CONFIG_NUMA
> - if (i == first_cpu(sd->groups->cpumask)) {
> - /* Only add "power" once for each physical package. */
> - sd = &per_cpu(node_domains, i);
> - sd->groups->cpu_power += power;
> + sd = &per_cpu(allnodes_domains, i);
> + if (sd->groups) {
> + power = SCHED_LOAD_SCALE + SCHED_LOAD_SCALE *
> + (cpus_weight(sd->groups->cpumask)-1) / 10;
> + sd->groups->cpu_power = power;
> }
> #endif
> }
>
> +#ifdef CONFIG_NUMA
> + for (i = 0; i < MAX_NUMNODES; i++) {
> + struct sched_group *sg = sched_group_nodes[i];
> + int j;
> +
> + if (sg == NULL)
> + continue;
> +next_sg:
> + for_each_cpu_mask(j, sg->cpumask) {
> + struct sched_domain *sd;
> + int power;
> +
> + sd = &per_cpu(phys_domains, j);
> + if (j != first_cpu(sd->groups->cpumask)) {
> + /*
> + * Only add "power" once for each
> + * physical package.
> + */
> + continue;
> + }
> + power = SCHED_LOAD_SCALE + SCHED_LOAD_SCALE *
> + (cpus_weight(sd->groups->cpumask)-1) / 10;
> +
> + sg->cpu_power += power;
> + }
> + sg = sg->next;
> + if (sg != sched_group_nodes[i])
> + goto next_sg;
> + }
> +#endif
> +
> /* Attach the domains */
> for_each_cpu_mask(i, *cpu_map) {
> struct sched_domain *sd;
> @@ -5692,13 +5858,10 @@ static void build_sched_domains(const cp
> /*
> * Set up scheduler domains and groups. Callers must hold the hotplug lock.
> */
> -static void arch_init_sched_domains(cpumask_t *cpu_map)
> +static void arch_init_sched_domains(const cpumask_t *cpu_map)
> {
> cpumask_t cpu_default_map;
>
> -#if defined(CONFIG_SCHED_SMT) && defined(CONFIG_NUMA)
> - check_sibling_maps();
> -#endif
> /*
> * Setup mask for cpus without special case scheduling requirements.
> * For now this just excludes isolated cpus, but could be used to
> @@ -5711,10 +5874,29 @@ static void arch_init_sched_domains(cpum
>
> static void arch_destroy_sched_domains(const cpumask_t *cpu_map)
> {
> - /* Do nothing: everything is statically allocated. */
> -}
> +#ifdef CONFIG_NUMA
> + int i;
> + for (i = 0; i < MAX_NUMNODES; i++) {
> + cpumask_t nodemask = node_to_cpumask(i);
> + struct sched_group *oldsg, *sg = sched_group_nodes[i];
> +
> + cpus_and(nodemask, nodemask, *cpu_map);
> + if (cpus_empty(nodemask))
> + continue;
>
> -#endif /* ARCH_HAS_SCHED_DOMAIN */
> + if (sg == NULL)
> + continue;
> + sg = sg->next;
> +next_sg:
> + oldsg = sg;
> + sg = sg->next;
> + kfree(oldsg);
> + if (oldsg != sched_group_nodes[i])
> + goto next_sg;
> + sched_group_nodes[i] = NULL;
> + }
> +#endif
> +}
>
> /*
> * Detach sched domains from a group of cpus specified in cpu_map
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/