RE: [RFC] sched/numa: don't move tasks to idle numa nodes while src node has very light load?
From: Song Bao Hua (Barry Song)
Date: Sat Oct 10 2020 - 19:01:08 EST
> -----Original Message-----
> From: Mel Gorman [mailto:mgorman@xxxxxxxxxxxxxxxxxxx]
> Sent: Tuesday, September 8, 2020 12:42 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@xxxxxxxxxxxxx>
> Cc: Mel Gorman <mgorman@xxxxxxx>; mingo@xxxxxxxxxx;
> peterz@xxxxxxxxxxxxx; juri.lelli@xxxxxxxxxx; vincent.guittot@xxxxxxxxxx;
> dietmar.eggemann@xxxxxxx; bsegall@xxxxxxxxxx;
> linux-kernel@xxxxxxxxxxxxxxx; Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>; Valentin
> Schneider <valentin.schneider@xxxxxxx>; Phil Auld <pauld@xxxxxxxxxx>;
> Hillf Danton <hdanton@xxxxxxxx>; Ingo Molnar <mingo@xxxxxxxxxx>;
> Linuxarm <linuxarm@xxxxxxxxxx>; Liguozhu (Kenneth)
> <liguozhu@xxxxxxxxxxxxx>
> Subject: Re: [RFC] sched/numa: don't move tasks to idle numa nodes while src
> node has very light load?
>
> On Mon, Sep 07, 2020 at 12:00:10PM +0000, Song Bao Hua (Barry Song)
> wrote:
> > Hi All,
> > In case we have a numa system with 4 nodes and in each node we have 24
> cpus, and all of the 96 cores are idle.
> > Then we start a process with 4 threads in this totally idle system.
> > Actually any one of the four numa nodes should have enough capability to
> run the 4 threads while they can still have 20 idle CPUS after that.
> > But right now the existing code in CFS load balance will spread the 4 threads
> to multiple nodes.
> > This results in two negative side effects:
> > 1. more numa nodes are awaken while they can save power in lowest
> frequency and halt status
> > 2. cache coherency overhead between numa nodes
> >
> > A proof-of-concept patch I made to "fix" this issue to some extent is like:
> >
>
> This is similar in concept to a patch that did something similar except
> in adjust_numa_imbalance(). It ended up being great for light loads like
> simple communicating pairs but fell apart for some HPC workloads when
> memory bandwidth requirements increased. Ultimately it was dropped until
Yes. There is a tradeoff between higher memory bandwidth and lower communication
overhead for things like bus latency, cache coherence. For kernel scheduler, it actually
doesn't know the requirement of applications. It doesn't know whether the application
is sensitive to memory bandwidth or sensitive to cache coherence unless applications
tell it by APIs like mempolicy().
It seems we can get the perf profiling data as the input for scheduler. If perf finds
the application needs lots of memory bandwidth, we spread it in more numa nodes.
Otherwise, if perf finds application gets low IPC due to cache coherence, we try to
put them in a numa node. Maybe it is too difficult for kernel, but if we could have
an userspace scheduler which call taskset, numactl based on perf profiling, it seems
the userspace scheduler can schedule applications more precisely based on the
characteristics of applications?
> the NUMA/CPU load balancing was reconciled so may be worth a revisit. At
> the time, it was really problematic once a one node was roughly 25% CPU
> utilised on a 2-socket machine with hyper-threading enabled. The patch may
> still work out but it would need wider testing. Within mmtests, the NAS
> workloads for D-class on a 2-socket machine varying the number of parallel
> tasks/processes are used should be enough to determine if the patch is
> free from side-effects for one machine. It gets problematic for different
> machine sizes as the point where memory bandwidth is saturated varies.
> group_weight/4 might be fine on one machine as a cut-off and a problem
> on a larger machine with more cores -- I hit that particular problem
> when one 2 socket machine with 48 logical CPUs was fine but a different
> machine with 80 logical CPUs regressed.
Different machines have different memory bandwidth and different numa topology.
If it is too tough to figure out a proper value to make everyone happy, would you think
if we can provide a sysctl or bootargs for this so that users can adjust the cut-off based
on their own test and profiling?
>
> I'm not saying the patch is wrong, just that patches in general for this
> area (everyone, not just you) need fairly broad testing.
>
Thanks
Barry