Re: [PATCH v4 04/11] sched/fair: rework load_balance

From: Valentin Schneider
Date: Wed Oct 30 2019 - 12:16:25 EST


On 30/10/2019 16:45, Mel Gorman wrote:
> On Fri, Oct 18, 2019 at 03:26:31PM +0200, Vincent Guittot wrote:
>> The load_balance algorithm contains some heuristics which have become
>> meaningless since the rework of the scheduler's metrics like the
>> introduction of PELT.
>>
>> Furthermore, load is an ill-suited metric for solving certain task
>> placement imbalance scenarios. For instance, in the presence of idle CPUs,
>> we should simply try to get at least one task per CPU, whereas the current
>> load-based algorithm can actually leave idle CPUs alone simply because the
>> load is somewhat balanced. The current algorithm ends up creating virtual
>> and meaningless value like the avg_load_per_task or tweaks the state of a
>> group to make it overloaded whereas it's not, in order to try to migrate
>> tasks.
>>
>
> I do not think this is necessarily 100% true. With both the previous
> load-balancing behaviour and the apparent behaviour of this patch, it's
> still possible to pull two communicating tasks apart and across NUMA
> domains when utilisation is low. Specifically, a load difference of less
> than SCHED_CAPACITY_SCALE between NUMA codes can be enough too migrate
> task to level out load.
>
> So, load might be ill-suited for some cases but that does not make it
> completely useless either.
>

For sure! What we're trying to get to is that load is much less relevant
at low task count than at high task count.

A pathological case for asymmetric systems (big.LITTLE et al) is if you spawn
as many tasks as you have CPUs and they're not all spread out straight from
wakeup (so you have at least one rq coscheduling 2 tasks), the load balancer
might never solve that imbalance and leave a CPU idle for the duration of the
workload.

The problem there is that the current LB tries to balance avg_load, which is
screwy when asymmetric capacities are involved.

Take the h960 for instance, which is 4+4 big.LITTLE (LITTLEs are 462 capacity,
bigs 1024). Say you end up with 3 tasks on the LITTLEs (one is idle), and 5
tasks on the bigs (none idle). The tasks are CPU hogs, so they won't go
through new wakeups.

You have something like 3072 load on the LITTLEs vs 5120 on the bigs. The
thing is, a group's avg_load scales inversely with capacity:

sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) /
sgs->group_capacity;

That means the LITTLEs group ends up with 1702 avg_load, whereas the bigs
group gets 1280 - the LITTLEs group is seen as more "busy" despite having an
idle CPU. This kind of imbalance usually ends up in fix_small_imbalance()
territory, which is random at best.

>> @@ -8022,9 +8076,16 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>> /*
>> * No need to call idle_cpu() if nr_running is not 0
>> */
>> - if (!nr_running && idle_cpu(i))
>> + if (!nr_running && idle_cpu(i)) {
>> sgs->idle_cpus++;
>> + /* Idle cpu can't have misfit task */
>> + continue;
>> + }
>> +
>> + if (local_group)
>> + continue;
>>
>> + /* Check for a misfit task on the cpu */
>> if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
>> sgs->group_misfit_task_load < rq->misfit_task_load) {
>> sgs->group_misfit_task_load = rq->misfit_task_load;
>
> So.... why exactly do we not care about misfit tasks on CPUs in the
> local group? I'm not saying you're wrong because you have a clear idea
> on how misfit tasks should be treated but it's very non-obvious just
> from the code.
>

Misfit tasks need to be migrated away to CPUs of higher capacity. We let that
happen through the usual load-balance pull - there's little point in pulling
from ourselves.

> I didn't see anything obvious after this point but I also am getting a
> bit on the fried side trying to hold this entire patch in my head and
> got hung up on the NUMA domain balancing in particular.
>

I share the feeling, I've done it by chunks but need to have another go at it.