Re: [PATCH] x86: Consider multiple nodes in a single socket to be "sane"

From: Peter Zijlstra
Date: Mon Sep 15 2014 - 23:29:40 EST


On Mon, Sep 15, 2014 at 03:26:41PM -0700, Dave Hansen wrote:
>
> I'm getting the spew below when booting with Haswell (Xeon
> E5-2699) CPUs and the "Cluster-on-Die" (CoD) feature enabled in
> the BIOS.

What is that cluster-on-die thing? I've heard it before but never could
find anything on it.

> This also fixes sysfs because CPUs with the same 'physical_package_id'
> in /sys/devices/system/cpu/cpu*/topology/ are not listed together
> in the same 'core_siblings_list'. This violates a statement from
> Documentation/ABI/testing/sysfs-devices-system-cpu:
>
> core_siblings: internal kernel map of cpu#'s hardware threads
> within the same physical_package_id.
>
> core_siblings_list: human-readable list of the logical CPU
> numbers within the same physical_package_id as cpu#.

No that statement is wrong; it assumes physical_package_id is a good
identifier for nodes. Clearly this is no longer true.

The idea is that core_siblings (or rather cpu_core_mask) is a mask of
all cores on a node.

> The sysfs effects here cause an issue with the hwloc tool where
> it gets confused and thinks there are more sockets than are
> physically present.

Meh, so then we need another mask.

The important bit you didn't show was the scheduler domain setup. I
suspect it all works by accident, not by design.

> diff -puN arch/x86/kernel/smpboot.c~hsw-cod-is-sane arch/x86/kernel/smpboot.c
> --- a/arch/x86/kernel/smpboot.c~hsw-cod-is-sane 2014-09-15 14:56:20.012314468 -0700
> +++ b/arch/x86/kernel/smpboot.c 2014-09-15 14:58:58.837506644 -0700
> @@ -344,10 +344,13 @@ static bool match_llc(struct cpuinfo_x86
> static bool match_mc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
> {
> if (c->phys_proc_id == o->phys_proc_id) {
> - if (cpu_has(c, X86_FEATURE_AMD_DCM))
> - return true;
> -
> - return topology_sane(c, o, "mc");
> + /*
> + * We used to enforce that 'c' and 'o' be on the
> + * same node, but AMD's DCM and Intel's Cluster-
> + * on-Die (CoD) support both have physical
> + * processors that span NUMA nodes.
> + */
> + return true;
> }
> return false;
> }

This is wrong (and I suppose the AMD case was already wrong). That
function is supposed to match a multi-core group which is very much
supposed to be smaller-or-equal to a node, not spanning nodes.

The scheduler assumes: SMT <= LLC <= MC <= NODE and if setting the MC
mask to cover multiple nodes works, its by accident.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/