Re: [PATCH] sched: fix constructing the span cpu mask of sched domain

From: Hillf Danton
Date: Wed May 11 2011 - 12:07:09 EST


On Tue, May 10, 2011 at 4:32 PM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> If you're interested in this area of the scheduler, you might want to
> have a poke at:
>
> http://marc.info/?l=linux-kernel&m=130218515520540
>
> That tries to rewrite the CONFIG_NUMA support for the sched_domain stuff
> to create domains based on the node_distance() to better reflect the
> actual machine topology.
>
> As stated, that patch is currently very broken, mostly because the
> topologies encountered don't map to non-overlapping trees. I've not yet
> come up with how to deal with that, but we sure need to do something
> like that, the current group 16 nodes and a group of all simply doesn't
> work well for today's machines now that NUMA is both common and the
> inter-node latencies are more relevant.
>

Hi Peter

Your work for rewriting NUMA support, published at
http://marc.info/?l=linux-kernel&m=130218515520540
is patched by changing how level is computed and by changing how it is
used to build the mask.

When computing, some valid levels are lost in your work.

When building mask, nodes are selected only if they have same distance,
thus nodes of less distance are also masked out since the computation of
level now is tough.

Without MUNA hardware, I did not test the patch:(

Hillf
---

--- numa_by_peter.c 2011-05-11 20:22:10.000000000 +0800
+++ numa_by_hillf.c 2011-05-11 21:06:26.000000000 +0800
@@ -1,6 +1,5 @@
static void sched_init_numa(void)
{
- int next_distance, curr_distance = node_distance(0, 0);
struct sched_domain_topology_level *tl;
int level = 0;
int i, j, k;
@@ -11,21 +10,34 @@ static void sched_init_numa(void)
if (!sched_domains_numa_distance)
return;

- next_distance = curr_distance;
- for (i = 0; i < nr_node_ids; i++) {
- for (j = 0; j < nr_node_ids; j++) {
- int distance = node_distance(0, j);
- printk("distance(0,%d): %d\n", j, distance);
- if (distance > curr_distance &&
- (distance < next_distance ||
- next_distance == curr_distance))
- next_distance = distance;
+ for (j = 0; j < nr_node_ids; j++) {
+ int distance = node_distance(0, j);
+ printk("distance(0,%d): %d\n", j, distance);
+ if (j == 0) {
+ sched_domains_numa_distance[j] = distance;
+ sched_domains_numa_levels = ++level;
+ continue;
}
- if (next_distance != curr_distance) {
- sched_domains_numa_distance[level++] = next_distance;
+ for (i = 0; i < level; i++) {
+ /* check if already exist */
+ if (distance == sched_domains_numa_distance[i])
+ goto next_node;
+ /* sort and insert it */
+ if (distance < sched_domains_numa_distance[i])
+ break;
+ }
+ if (i == level) {
+ sched_domains_numa_distance[level++] = distance;
sched_domains_numa_levels = level;
- curr_distance = next_distance;
- } else break;
+ continue;
+ }
+ for (k = level -1; k >= i; k--)
+ sched_domains_numa_distance[k+1] =
+ sched_domains_numa_distance[k];
+ sched_domains_numa_distance[i] = distance;
+ sched_domains_numa_levels = ++level;
+next_node:
+ ;
}

sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL);
@@ -44,8 +56,9 @@ static void sched_init_numa(void)
struct cpumask *mask =
per_cpu_ptr(sched_domains_numa_masks[i], j);

+ cpumask_clear(mask);
for (k = 0; k < nr_node_ids; k++) {
- if (node_distance(cpu_to_node(j), k) >
+ if (node_distance(cpu_to_node(j), k) !=
sched_domains_numa_distance[i])
continue;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/