Re: [PATCH] sched/topology: Improve load balancing on AMD EPYC

From: Mel Gorman
Date: Mon Jun 24 2019 - 10:24:27 EST


On Wed, Jun 19, 2019 at 10:34:37PM +0100, Matt Fleming wrote:
> On Tue, 18 Jun, at 02:33:18PM, Peter Zijlstra wrote:
> > On Tue, Jun 18, 2019 at 11:43:19AM +0100, Matt Fleming wrote:
> > > This works for me under all my tests. Thoughts?
> > >
> > > --->8---
> > >
> > > diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> > > index 80a405c2048a..4db4e9e7654b 100644
> > > --- a/arch/x86/kernel/cpu/amd.c
> > > +++ b/arch/x86/kernel/cpu/amd.c
> > > @@ -8,6 +8,7 @@
> > > #include <linux/sched.h>
> > > #include <linux/sched/clock.h>
> > > #include <linux/random.h>
> > > +#include <linux/topology.h>
> > > #include <asm/processor.h>
> > > #include <asm/apic.h>
> > > #include <asm/cacheinfo.h>
> > > @@ -824,6 +825,8 @@ static void init_amd_zn(struct cpuinfo_x86 *c)
> > > {
> > > set_cpu_cap(c, X86_FEATURE_ZEN);
> > >
> >
> > I'm thinking this deserves a comment. Traditionally the SLIT table held
> > relative memory latency. So where the identity is 10, 16 would indicate
> > 1.6 times local latency and 32 would be 3.2 times local.
> >
> > Now, even very early on BIOS monkeys went about their business and put
> > in random values in an attempt to 'tune' the system based on how
> > $random-os behaved, which is all sorts of fu^Wwrong.
> >
> > Now, I suppose my question is; is that 32 Zen puts in an actual relative
> > memory latency metric, or a random value we somehow have to deal with.
> > And can we pretty please describe the whole sordid story behind this
> > 'tunable' somewhere?
>
> This is one for the AMD folks. I don't know if the memory latency
> really is 3.2 times or not, only that that's the value in all the Zen
> machines I have access to. Even this 2-socket one:
>
> node distances:
> node 0 1
> 0: 10 32
> 1: 32 10
>
> Tom, Suravee?

Do not consider this an authorative response but based on what I know
of the physical topology, it is not unreasonable to use 32 in the SLIT
table. There is a small latency when accessing another die on the same
socket (details are generation specific). It's not quite a local access
but it's not as much as a traditional remote access either (hence 16 being
the base unit for another die to hint that it's not quite local but not
quite remote either). 32 is based on accessing a die on a remote socket
based on the expected performance and latency of the interconnect.

To the best of my knowledge, the magic numbers are reflective of the real
topology and not just a gamification of the numbers for a random OS. If
anything, the fact that there is a load balancing issue on Linux would
indicate that they were not picking random numbers for Linux at least :P

--
Mel Gorman
SUSE Labs