Re: [RFC] sched/numa: don't move tasks to idle numa nodes while src node has very light load?

From: Vincent Guittot
Date: Mon Sep 07 2020 - 08:45:37 EST


On Mon, 7 Sep 2020 at 14:00, Song Bao Hua (Barry Song)
<song.bao.hua@xxxxxxxxxxxxx> wrote:
>
> Hi All,
> In case we have a numa system with 4 nodes and in each node we have 24 cpus, and all of the 96 cores are idle.
> Then we start a process with 4 threads in this totally idle system.
> Actually any one of the four numa nodes should have enough capability to run the 4 threads while they can still have 20 idle CPUS after that.
> But right now the existing code in CFS load balance will spread the 4 threads to multiple nodes.
> This results in two negative side effects:
> 1. more numa nodes are awaken while they can save power in lowest frequency and halt status
> 2. cache coherency overhead between numa nodes
>
> A proof-of-concept patch I made to "fix" this issue to some extent is like:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1a68a05..f671e15 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9068,9 +9068,20 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> }
>
> /* Consider allowing a small imbalance between NUMA groups */
> - if (env->sd->flags & SD_NUMA)
> + if (env->sd->flags & SD_NUMA) {
> + /* if the src group uses only 1/4 capability and dst is idle
> + * don't move task
> + */
> + if (busiest->sum_nr_running <= busiest->group_weight/4 &&
> + local->sum_nr_running == 0) {
> + env->imbalance = 0;
> + return;

Without considering if that makes sense or not, such tests should be
in adjust_numa_imbalance() which is there to decide if it's worth
"fixing" the imbalance between numa node or not.

The default behavior of load balancer is all about spreading tasks.
Then we have 2 NUMA hooks to prevent this to happen if it doesn't make
sense:
-This adjust_numa_imbalance()
-The fbq_type which is used to skip some rqs

Finally, there were several discussions around adjust_numa_imbalance()
when it was introduced and one was how to define how much imbalance is
allowed that will not regress the performance. The conclusion was that
it depends of a lot of inputs about the topology like the number of
CPUs, the number of nodes, the distance between nodes and several
others things. So as a 1st step, it was decided to use the simple and
current implementation.

The 1/4 threshold that you use above may work for some used cases on
your system but will most probably be wrong for others. We must find
something that is not just a heuristic and can work of other system
too



> + }
> env->imbalance = adjust_numa_imbalance(env->imbalance,
> busiest->sum_nr_running);
> + }
>
> return;
> }
>
> And I wrote a simple process with 4 threads to measure the execution time:
>
> #include <stdio.h>
> #include <pthread.h>
> #include <sys/types.h>
>
> struct foo {
> int x;
> int y;
> } f1;
>
> void* thread_fun1(void* param)
> {
> int s = 0;
> for (int i = 0; i < 1000000000; ++i)
> s += f1.x;
> return NULL;
> }
>
> void* thread_fun2(void* param)
> {
> for (int i = 0; i < 1000000000; ++i)
> ++f1.y;
> return NULL;
> }
>
> double difftimeval(const struct timeval *start, const struct timeval *end)
> {
> double d;
> time_t s;
> suseconds_t u;
>
> s = start->tv_sec - end->tv_sec;
> u = start->tv_usec - end->tv_usec;
>
> d = s;
> d += u/1000000.0;
>
> return d;
> }
>
> int main(void)
> {
> pthread_t tid1,tid2,tid3,tid4;
> struct timeval start,end;
>
> gettimeofday(&start, NULL);
>
> pthread_create(&tid1,NULL,thread_fun1,NULL);
> pthread_create(&tid2,NULL,thread_fun2,NULL);
> pthread_create(&tid3,NULL,thread_fun1,NULL);
> pthread_create(&tid4,NULL,thread_fun2,NULL);
>
> pthread_join(tid1,NULL);
> pthread_join(tid2,NULL);
> pthread_join(tid3,NULL);
> pthread_join(tid4,NULL);
>
> gettimeofday(&end, NULL);
>
> printf("execution time:%f\n", difftimeval(&end, &start));
> }
>
> Before the PoC patch, the test result:
> $ ./a.out
> execution time:10.734581
>
> After the PoC patch, the test result:
> $ ./a.out
> execution time:6.775150
>
> The execution time reduces around 30-40% because 4 threads are put in single one numa node.
>
> On the other hand, the patch doesn't have to depend on NUMA, it can also apply to SCHED_MC with some changes. If one CPU can be still idle after they handle all tasks in the system, we maybe not need to wake up the 2nd CPU at all?
>
> I understand this PoC patch could have negative side effect in some corner cases, for example, if the four threads running in one process want more memory bandwidth by running in multiple nodes. But generally speaking, we do a tradeoff between cache locality and better CPU utilization as they are the main concerns. If one process highly depends on memory bandwidth, they may change their mempolicy?
>
> Thanks
> Barry