Re: [RFC PATCH] sched/topology: Introduce NUMA identity node sched domain

From: Peter Zijlstra
Date: Mon Oct 02 2017 - 08:44:00 EST



Sorry, lost track of this thing..

On Mon, Aug 14, 2017 at 02:44:59PM +0700, Suravee Suthikulpanit wrote:

> > Why does the degenerate code fail to remove things?
> >
>
> Sorry for confusion. Actually, the degenerate code does remove the duplicate
> NODE sched-domain.
>
> The difference is, without !cpumask_equal(), now the MC sched-domain
> would have the SD_PREFER_SIBLING flag set by the degenerate code since
> the flag got transferred down from the NODE to MC sched-domain. Would
> this be the preferred behavior for MC sched-domain?

Right, but the 'normal' x86_topology setup will have SD_PREFER_SIBLING
set on the MC domain by virtue of merging the MC and CPU domains. So the
first layer of NUMA spreads.

Now, that appears not be the case for x86_numa_in_package_topology,
since that doesn't do the CPU domain, but your patch would effectively
'fix' that if it were to not have that conditional. But it would keep
the PREFER_SIBLING at the NODE level, MC would not get it.

But I'm wondering if we're not doing this all a bit backwards. Having
SD_PREFER_SIBLING on the L3 makes sense in that it will spread workload
across the machine and maximize cache utilization. But we don't set it
directly, instead we inherit it through merging.

So commit 8e7fbcbc22c1 ("sched: Remove stale power aware scheduling
remnants and dysfunctional knobs") made a fail by not preserving the
SD_PREFER_SIBLING for the !power_saving case on both CPU and MC.

Then commit 6956dc568f34 ("sched/numa: Add SD_PERFER_SIBLING to CPU
domain") adds it back to the CPU but not MC.

So I'm thinking it makes sense restore the explicit MC one and make your
NODE domain unconditional.

That would make behaviour consistent between the various modes;
something like the below. I'll go try and update your Changelog so we
can get this merged.

--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1157,6 +1157,7 @@ sd_init(struct sched_domain_topology_lev
sd->smt_gain = 1178; /* ~15% */

} else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
+ sd->flags |= SD_PREFER_SIBLING;
sd->imbalance_pct = 117;
sd->cache_nice_tries = 1;
sd->busy_idx = 2;
@@ -1331,6 +1332,10 @@ void sched_init_numa(void)
if (!sched_domains_numa_distance)
return;

+ /* Includes NUMA identity node at level 0. */
+ sched_domains_numa_distance[level++] = curr_distance;
+ sched_domains_numa_levels = level;
+
/*
* O(nr_nodes^2) deduplicating selection sort -- in order to find the
* unique distances in the node_distance() table.
@@ -1378,8 +1383,7 @@ void sched_init_numa(void)
return;

/*
- * 'level' contains the number of unique distances, excluding the
- * identity distance node_distance(i,i).
+ * 'level' contains the number of unique distances
*
* The sched_domains_numa_distance[] array includes the actual distance
* numbers.
@@ -1441,9 +1445,18 @@ void sched_init_numa(void)
tl[i] = sched_domain_topology[i];

/*
+ * Add the NUMA identity distance, aka single NODE.
+ */
+ tl[i++] = (struct sched_domain_topology_level){
+ .mask = sd_numa_mask,
+ .numa_level = 0,
+ SD_INIT_NAME(NODE)
+ };
+
+ /*
* .. and append 'j' levels of NUMA goodness.
*/
- for (j = 0; j < level; i++, j++) {
+ for (j = 1; j < level; i++, j++) {
tl[i] = (struct sched_domain_topology_level){
.mask = sd_numa_mask,
.sd_flags = cpu_numa_flags,