Re: EEVDF and NUMA balancing

From: Julia Lawall
Date: Mon Dec 18 2023 - 08:59:11 EST

Next message: Sebastian Ene: "[PATCH v4 00/10] arm64: ptdump: View the second stage page-tables"
Previous message: Dragos Tatulea: "Re: [PATCH vhost v2 4/8] vdpa/mlx5: Mark vq addrs for modification in hw vq"
Next in thread: Vincent Guittot: "Re: EEVDF and NUMA balancing"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello,

I have looked further into the NUMA balancing issue.

The context is that there are 2N threads running on 2N cores, one thread
gets NUMA balanced to the other socket, leaving N+1 threads on one socket
and N-1 threads on the other socket. This condition typically persists
for one or more seconds.

Previously, I reported this on a 4-socket machine, but it can also occur
on a 2-socket machine, with other tests from the NAS benchmark suite
(sp.B, bt.B, etc).

Since there are N+1 threads on one of the sockets, it would seem that load
balancing would quickly kick in to bring some thread back to socket with
only N-1 threads. This doesn't happen, though, because actually most of
the threads have some NUMA effects such that they have a preferred node.
So there is a high chance that an attempt to steal will fail, because both
threads have a preference for the socket.

At this point, the only hope is active balancing. However, triggering
active balancing requires the success of the following condition in
imbalanced_active_balance:

if ((env->migration_type == migrate_task) &&
(sd->nr_balance_failed > sd->cache_nice_tries+2))

sd->nr_balance_failed does not increase because the core is idle. When a
core is idle, it comes to the load_balance function from schedule() though
newidle_balance. newidle_balance always sends in the flag CPU_NEWLY_IDLE,
even if the core has been idle for a long time.

Changing newidle_balance to use CPU_IDLE rather than CPU_NEWLY_IDLE when
the core was already idle before the call to schedule() is not enough
though, because there is also the constraint on the migration type. That
turns out to be (mostly?) migrate_util. Removing the following
code from find_busiest_queue:

/*
* Don't try to pull utilization from a CPU with one
* running task. Whatever its utilization, we will fail
* detach the task.
*/
if (nr_running <= 1)
continue;

and changing the above test to:

if ((env->migration_type == migrate_task || env->migration_type == migrate_util) &&
(sd->nr_balance_failed > sd->cache_nice_tries+2))

seems to solve the problem.

I will test this on more applications. But let me know if the above
solution seems completely inappropriate. Maybe it violates some other
constraints.

I have no idea why this problem became more visible with EEVDF. It seems
to have to do with the time slices all turning out to be the same. I got
the same behavior in 6.5 by overwriting the timeslice calculation to
always return 1. But I don't see the connection between the timeslice and
the behavior of the idle task.

thanks,
julia

Next message: Sebastian Ene: "[PATCH v4 00/10] arm64: ptdump: View the second stage page-tables"
Previous message: Dragos Tatulea: "Re: [PATCH vhost v2 4/8] vdpa/mlx5: Mark vq addrs for modification in hw vq"
Next in thread: Vincent Guittot: "Re: EEVDF and NUMA balancing"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]