Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2

From: Peter Zijlstra
Date: Wed Jan 08 2020 - 08:19:31 EST


On Tue, Jan 07, 2020 at 08:24:06PM +0000, Mel Gorman wrote:
> Now I get you, but unfortunately it also would not work out. The number
> of groups is not related to the LLC except in some specific cases.
> It's possible to use the first CPU to find the size of an LLC but now I
> worry that it would lead to unpredictable behaviour. AMD has different
> numbers of LLCs per node depending on the CPU family and while Intel
> generally has one LLC per node, I imagine there are counter examples.

Intel has the 'fun' case of an LLC spanning nodes :-), although Linux
pretends this isn't so and truncates the LLC topology information to be
the node again -- see arch/x86/kernel/smpboot.c:match_llc().

And of course, in the Core2 era we had the Core2Quad chips which was a
dual-die solution and therefore also had multiple LLCs, and I think the
Xeon variant of that would allow the multiple LLC per node situation
too, although this is of course ancient hardware nobody really cares
about anymore.

> This means that load balancing on different machines with similar core
> counts will behave differently due to the LLC size.

That sounds like perfectly fine/expected behaviour to me.

> It might be possible
> to infer it if the intermediate domain was DIE instead of MC but I doubt
> that's guaranteed and it would still be unpredictable. It may be the type
> of complexity that should only be introduced with a separate patch with
> clear rationale as to why it's necessary and we are not at that threshold
> so I withdraw the suggestion.

So IIRC the initial patch(es) had the idea to allow for 1 extra task
imbalance to get 1-1 pairs on the same node, instead of across nodes. I
don't immediately see that in these later patches.

Would that be something to go back to? Would that not side-step much of
the issues under discussion here?