Re: [PATCH] sched/topology: Improve load balancing on AMD EPYC

From: Suthikulpanit, Suravee
Date: Wed Jun 26 2019 - 17:18:46 EST


On 6/24/19 9:24 AM, Mel Gorman wrote:
> On Wed, Jun 19, 2019 at 10:34:37PM +0100, Matt Fleming wrote:
>> On Tue, 18 Jun, at 02:33:18PM, Peter Zijlstra wrote:
>>> On Tue, Jun 18, 2019 at 11:43:19AM +0100, Matt Fleming wrote:
>>>> This works for me under all my tests. Thoughts?
>>>>
>>>> --->8---
>>>>
>>>> diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
>>>> index 80a405c2048a..4db4e9e7654b 100644
>>>> --- a/arch/x86/kernel/cpu/amd.c
>>>> +++ b/arch/x86/kernel/cpu/amd.c
>>>> @@ -8,6 +8,7 @@
>>>> #include <linux/sched.h>
>>>> #include <linux/sched/clock.h>
>>>> #include <linux/random.h>
>>>> +#include <linux/topology.h>
>>>> #include <asm/processor.h>
>>>> #include <asm/apic.h>
>>>> #include <asm/cacheinfo.h>
>>>> @@ -824,6 +825,8 @@ static void init_amd_zn(struct cpuinfo_x86 *c)
>>>> {
>>>> set_cpu_cap(c, X86_FEATURE_ZEN);
>>>>
>>>
>>> I'm thinking this deserves a comment. Traditionally the SLIT table held
>>> relative memory latency. So where the identity is 10, 16 would indicate
>>> 1.6 times local latency and 32 would be 3.2 times local.
>>>
>>> Now, even very early on BIOS monkeys went about their business and put
>>> in random values in an attempt to 'tune' the system based on how
>>> $random-os behaved, which is all sorts of fu^Wwrong.
>>>
>>> Now, I suppose my question is; is that 32 Zen puts in an actual relative
>>> memory latency metric, or a random value we somehow have to deal with.
>>> And can we pretty please describe the whole sordid story behind this
>>> 'tunable' somewhere?
>>
>> This is one for the AMD folks. I don't know if the memory latency
>> really is 3.2 times or not, only that that's the value in all the Zen
>> machines I have access to. Even this 2-socket one:
>>
>> node distances:
>> node 0 1
>> 0: 10 32
>> 1: 32 10
>>
>> Tom, Suravee?
>
> Do not consider this an authorative response but based on what I know
> of the physical topology, it is not unreasonable to use 32 in the SLIT
> table. There is a small latency when accessing another die on the same
> socket (details are generation specific). It's not quite a local access
> but it's not as much as a traditional remote access either (hence 16 being
> the base unit for another die to hint that it's not quite local but not
> quite remote either). 32 is based on accessing a die on a remote socket
> based on the expected performance and latency of the interconnect.
>
> To the best of my knowledge, the magic numbers are reflective of the real
> topology and not just a gamification of the numbers for a random OS. If
> anything, the fact that there is a load balancing issue on Linux would
> indicate that they were not picking random numbers for Linux at least :P
>

We use 16 to designate 1-hop latency (for different node within the same socket).
For across-socket access, since the latency is greater, we set the latency to 32
(twice the latency of 1-hop) not aware of the RECLAIM_DISTANCE at the time.

At this point, it might not be possible to change the SLIT values on
existing platforms out in the field. So, introducing the AMD family17h
quirk as Matt suggested would be a more feasible approach.

Going forward, we will make sure that this would not exceed the standard
RECLAIM_DISTANCE (30).

Thanks,
Suravee