Re: [PATCH v4 04/11] sched/fair: rework load_balance

From: Mel Gorman
Date: Thu Oct 31 2019 - 07:40:27 EST


On Thu, Oct 31, 2019 at 12:13:09PM +0100, Vincent Guittot wrote:
> > > > On the last one, spreading tasks evenly across NUMA domains is not
> > > > necessarily a good idea. If I have 2 tasks running on a 2-socket machine
> > > > with 24 logical CPUs per socket, it should not automatically mean that
> > > > one task should move cross-node and I have definitely observed this
> > > > happening. It's probably bad in terms of locality no matter what but it's
> > > > especially bad if the 2 tasks happened to be communicating because then
> > > > load balancing will pull apart the tasks while wake_affine will push
> > > > them together (and potentially NUMA balancing as well). Note that this
> > > > also applies for some IO workloads because, depending on the filesystem,
> > > > the task may be communicating with workqueues (XFS) or a kernel thread
> > > > (ext4 with jbd2).
> > >
> > > This rework doesn't touch the NUMA_BALANCING part and NUMA balancing
> > > still gives guidances with fbq_classify_group/queue.
> >
> > I know the NUMA_BALANCING part is not touched, I'm talking about load
> > balancing across SD_NUMA domains which happens independently of
> > NUMA_BALANCING. In fact, there is logic in NUMA_BALANCING that tries to
> > override the load balancer when it moves tasks away from the preferred
> > node.
>
> Yes. this patchset relies on this override for now to prevent moving task away.

Fair enough, netperf hits the corner case where it does not work but
that is also true without your series.

> I agree that additional patches are probably needed to improve load
> balance at NUMA level and I expect that this rework will make it
> simpler to add.
> I just wanted to get the output of some real use cases before defining
> more numa level specific conditions. Some want to spread on there numa
> nodes but other want to keep everything together. The preferred node
> and fbq_classify_group was the only sensible metrics to me when he
> wrote this patchset but changes can be added if they make sense.
>

That's fair. While it was possible to address the case before your
series, it was a hatchet job. If the changelog simply notes that some
special casing may still be required for SD_NUMA but it's outside the
scope of the series, then I'd be happy. At least there is a good chance
then if there is follow-up work that it won't be interpreted as an
attempt to reintroduce hacky heuristics.

> >
> > > But the latter could also take advantage of the new type of group. For
> > > example, what I did in the fix for find_idlest_group : checking
> > > numa_preferred_nid when the group has capacity and keep the task on
> > > preferred node if possible. Similar behavior could also be beneficial
> > > in periodic load_balance case.
> > >
> >
> > And this is the catch -- numa_preferred_nid is not guaranteed to be set at
> > all. NUMA balancing might be disabled, the task may not have been running
> > long enough to pick a preferred NID or NUMA balancing might be unable to
> > pick a preferred NID. The decision to avoid unnecessary migrations across
> > NUMA domains should be made independently of NUMA balancing. The netperf
> > configuration from mmtests is great at illustrating the point because it'll
> > also say what the average local/remote access ratio is. 2 communicating
> > tasks running on an otherwise idle NUMA machine should not have the load
> > balancer move the server to one node and the client to another.
>
> I'm going to make it a try on my setup to see the results
>

Thanks.

--
Mel Gorman
SUSE Labs