Re: "Cache" sched domains

From: Peter Zijlstra
Date: Thu Jun 16 2011 - 08:28:18 EST


On Thu, 2011-06-16 at 14:11 +0200, Samuel Thibault wrote:
> Hello,
>
> We have an x86 machine whose sockets look like this in hwloc:
>
> ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> âSocket P#1 â
> ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> ââL3 (16MB) ââ
> ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> ââL2 (3072KB) ââL2 (3072KB) ââL2 (3072KB) ââ
> ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> ââL1 (32KB)ââL1 (32KB)ââL1 (32KB)ââL1 (32KB)ââL1 (32KB)ââL1 (32KB)ââ
> ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> ââCore P#0 ââCore P#1 ââCore P#2 ââCore P#3 ââCore P#4 ââCore P#5 ââ
> ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> âââPU P#0 ââââPU P#4 ââââPU P#8 ââââPU P#12ââââPU P#16ââââPU P#20âââ
> ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ

Pretty, bonus points for effort there.

> However, Linux does not build sched domains for the pairs of cores
> which share an L2 cache. On s390, IBM added sched domains for books,
> that is, sets of cores which share an L2 cache. The support should
> probably be added in a generic way for all archs thanks to generic cache
> information.

Yeah, sched domain generation is currently somewhat crappy.

I think you'll find you'll get that L2 domain when you enable mc/smt
power savings on !magny-cours due to this particular horror in
arch/x86/kernel/smpboot.c (possibly loosing another level due to other
crap and changing scheduler behaviour in ways you might not fancy):

const struct cpumask *cpu_coregroup_mask(int cpu)
{
struct cpuinfo_x86 *c = &cpu_data(cpu);
/*
* For perf, we return last level cache shared map.
* And for power savings, we return cpu_core_map
*/
if ((sched_mc_power_savings || sched_smt_power_savings) &&
!(cpu_has(c, X86_FEATURE_AMD_DCM)))
return cpu_core_mask(cpu);
else
return cpu_llc_shared_mask(cpu);
}

I recently started reworking all that sched_domain crud and we're almost
at the point where we can remove all legacy 'level' crap. That is,
nothing in the scheduler should (and does, last time I checked) depend
on sd->level anymore.

So the current goal is to change sched_domain_topology to not be such a
silly hard coded list of domains, but build that thing dynamically based
on the system topology and set all the SD_flags correctly.

If that is something you're willing to work on, that'd be totally
awesome.
¢éì®&Þ~º&¶¬–+-±éÝ¥Šw®žË±Êâmébžìdz¹Þ)í…æèw*jg¬±¨¶‰šŽŠÝj/êäz¹ÞŠà2ŠÞ¨è­Ú&¢)ß«a¶Úþø®G«éh®æj:+v‰¨Šwè†Ù>Wš±êÞiÛaxPjØm¶Ÿÿà -»+ƒùdš_