Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2

From: Mel Gorman
Date: Wed Jan 08 2020 - 09:03:55 EST


On Wed, Jan 08, 2020 at 02:18:58PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 07, 2020 at 08:24:06PM +0000, Mel Gorman wrote:
> > Now I get you, but unfortunately it also would not work out. The number
> > of groups is not related to the LLC except in some specific cases.
> > It's possible to use the first CPU to find the size of an LLC but now I
> > worry that it would lead to unpredictable behaviour. AMD has different
> > numbers of LLCs per node depending on the CPU family and while Intel
> > generally has one LLC per node, I imagine there are counter examples.
>
> Intel has the 'fun' case of an LLC spanning nodes :-), although Linux
> pretends this isn't so and truncates the LLC topology information to be
> the node again -- see arch/x86/kernel/smpboot.c:match_llc().
>

That's the Sub-NUMA Clustering, right? I thought that manifested as a
separate NUMA node in Linux. Not exactly right from a topology point of
view, but close enough. Still, you're right in that identifying the LLC
specifically would not necessarily be the way to go. FWIW, I have one
machine with SNC enabled and didn't notice anything problematic in the
versions of the patch released to date.

> And of course, in the Core2 era we had the Core2Quad chips which was a
> dual-die solution and therefore also had multiple LLCs, and I think the
> Xeon variant of that would allow the multiple LLC per node situation
> too, although this is of course ancient hardware nobody really cares
> about anymore.
>

Ok, even though they're older, it's enough counter-examples to be concerned
about. I'm definitely dropping the LLC considerations for the moment :)

> > This means that load balancing on different machines with similar core
> > counts will behave differently due to the LLC size.
>
> That sounds like perfectly fine/expected behaviour to me.
>

Even so, I think it should not be part of the initial patch. I would only
be happy if I had enough different machine types to prove that specific
special casing is required and we cannot simply rely on standard load
balancing of the domains below SD_NUMA.

> > It might be possible
> > to infer it if the intermediate domain was DIE instead of MC but I doubt
> > that's guaranteed and it would still be unpredictable. It may be the type
> > of complexity that should only be introduced with a separate patch with
> > clear rationale as to why it's necessary and we are not at that threshold
> > so I withdraw the suggestion.
>
> So IIRC the initial patch(es) had the idea to allow for 1 extra task
> imbalance to get 1-1 pairs on the same node, instead of across nodes. I
> don't immediately see that in these later patches.
>

Not quite -- I had a minimum allowed imbalance of 2 tasks -- mostly for
very small NUMA domains. By v3, it was not necessary because the value
was hard-coded regardless of the number of CPUs. I think I'll leave it
out because I don't think it's worth worrying about the imbalance between
NUMA domains of less than 4 CPUs (do they even exist any more?)

> Would that be something to go back to? Would that not side-step much of
> the issues under discussion here?

Allowing just 1 extra task would work for netperf in some cases except when
softirq is involved. It would partially work for IO on ext4 as it's only
communicating with one journal thread but a bit more borderline for XFS
due to workqueue usage. XFS is not a massive concern in this context as
the workqueue is close to the IO issuer and short-lived so I don't think
it would crop up much for load balancing unlike ext4 where jbd2 could be
very active.

If v4 of the patch fails to meet approval then I'll try a patch that
allows a hard-coded imbalance of 2 tasks (one communicating task and
one kthread) regardless of NUMA domain span up to 50% of utilisation
and see what that gets. One way or the other, I would like to get basic
NUMA balancing issue out of the way before even looking at NUMA balancing
and whether it needs to be adjusted based on the load balancing rewrite.
NUMA balancing already has some logic that fights load balancer decisions
and I don't want to make that worse, I would be delighted if we could
even delete that migration retry check that overrides the load balancer
because it always was a bit nasty.

--
Mel Gorman
SUSE Labs