Re: EEVDF and NUMA balancing

From: Vincent Guittot
Date: Mon Dec 18 2023 - 12:18:58 EST

Next message: Daniel Lezcano: "[PATCH v2 1/2] thermal/debugfs: Add thermal cooling device debugfs information"
Previous message: Petr Mladek: "Re: [re: PATCH v2 00/15 - 07/11] dyndbg: __skip_spaces"
In reply to: Julia Lawall: "Re: EEVDF and NUMA balancing"
Next in thread: Julia Lawall: "Re: EEVDF and NUMA balancing"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, 18 Dec 2023 at 14:58, Julia Lawall <julia.lawall@xxxxxxxx> wrote:
>
> Hello,
>
> I have looked further into the NUMA balancing issue.
>
> The context is that there are 2N threads running on 2N cores, one thread
> gets NUMA balanced to the other socket, leaving N+1 threads on one socket
> and N-1 threads on the other socket. This condition typically persists
> for one or more seconds.
>
> Previously, I reported this on a 4-socket machine, but it can also occur
> on a 2-socket machine, with other tests from the NAS benchmark suite
> (sp.B, bt.B, etc).
>
> Since there are N+1 threads on one of the sockets, it would seem that load
> balancing would quickly kick in to bring some thread back to socket with
> only N-1 threads. This doesn't happen, though, because actually most of
> the threads have some NUMA effects such that they have a preferred node.
> So there is a high chance that an attempt to steal will fail, because both
> threads have a preference for the socket.
>
> At this point, the only hope is active balancing. However, triggering
> active balancing requires the success of the following condition in
> imbalanced_active_balance:
>
> if ((env->migration_type == migrate_task) &&
> (sd->nr_balance_failed > sd->cache_nice_tries+2))
>
> sd->nr_balance_failed does not increase because the core is idle. When a
> core is idle, it comes to the load_balance function from schedule() though
> newidle_balance. newidle_balance always sends in the flag CPU_NEWLY_IDLE,
> even if the core has been idle for a long time.

Do you mean that you never kick a normal idle load balance ?

>
> Changing newidle_balance to use CPU_IDLE rather than CPU_NEWLY_IDLE when
> the core was already idle before the call to schedule() is not enough
> though, because there is also the constraint on the migration type. That
> turns out to be (mostly?) migrate_util. Removing the following
> code from find_busiest_queue:
>
> /*
> * Don't try to pull utilization from a CPU with one
> * running task. Whatever its utilization, we will fail
> * detach the task.
> */
> if (nr_running <= 1)
> continue;

I'm surprised that load_balance wants to "migrate_util" instead of
"migrate_task"

You have N+1 threads on a group of 2N CPUs so you should have at most
1 thread per CPUs in your busiest group. In theory you should have the
local "group_has_spare" and the busiest "group_fully_busy" (at most).
This means that no group should be overloaded and load_balance should
not try to migrate utli but only task

>
> and changing the above test to:
>
> if ((env->migration_type == migrate_task || env->migration_type == migrate_util) &&
> (sd->nr_balance_failed > sd->cache_nice_tries+2))
>
> seems to solve the problem.
>
> I will test this on more applications. But let me know if the above
> solution seems completely inappropriate. Maybe it violates some other
> constraints.
>
> I have no idea why this problem became more visible with EEVDF. It seems
> to have to do with the time slices all turning out to be the same. I got
> the same behavior in 6.5 by overwriting the timeslice calculation to
> always return 1. But I don't see the connection between the timeslice and
> the behavior of the idle task.
>
> thanks,
> julia

Next message: Daniel Lezcano: "[PATCH v2 1/2] thermal/debugfs: Add thermal cooling device debugfs information"
Previous message: Petr Mladek: "Re: [re: PATCH v2 00/15 - 07/11] dyndbg: __skip_spaces"
In reply to: Julia Lawall: "Re: EEVDF and NUMA balancing"
Next in thread: Julia Lawall: "Re: EEVDF and NUMA balancing"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]