Re: [RFC PATCH] sched/fair: Bias runqueue selection towards almost idle prev CPU

From: Mathieu Desnoyers
Date: Tue Oct 10 2023 - 09:49:58 EST


On 2023-10-09 01:14, Chen Yu wrote:
On 2023-09-30 at 07:45:38 -0400, Mathieu Desnoyers wrote:
On 9/30/23 03:11, Chen Yu wrote:
Hi Mathieu,

On 2023-09-29 at 14:33:50 -0400, Mathieu Desnoyers wrote:
Introduce the WAKEUP_BIAS_PREV_IDLE scheduler feature. It biases
select_task_rq towards the previous CPU if it was almost idle
(avg_load <= 0.1%).

Yes, this is a promising direction IMO. One question is that,
can cfs_rq->avg.load_avg be used for percentage comparison?
If I understand correctly, load_avg reflects that more than
1 tasks could have been running this runqueue, and the
load_avg is the direct proportion to the load_weight of that
cfs_rq. Besides, LOAD_AVG_MAX seems to not be the max value
that load_avg can reach, it is the sum of
1024 * (y + y^1 + y^2 ... )

For example,
taskset -c 1 nice -n -20 stress -c 1
cat /sys/kernel/debug/sched/debug | grep 'cfs_rq\[1\]' -A 12 | grep "\.load_avg"
.load_avg : 88763
.load_avg : 1024

88763 is higher than LOAD_AVG_MAX=47742

I would have expected the load_avg to be limited to LOAD_AVG_MAX somehow,
but it appears that it does not happen in practice.

That being said, if the cutoff is really at 0.1% or 0.2% of the real max,
does it really matter ?

Maybe the util_avg can be used for precentage comparison I suppose?
[...]
Or
return cpu_util_without(cpu_rq(cpu), p) * 1000 <= capacity_orig_of(cpu) ?

Unfortunately using util_avg does not seem to work based on my testing.
Even at utilization thresholds at 0.1%, 1% and 10%.

Based on comments in fair.c:

* CPU utilization is the sum of running time of runnable tasks plus the
* recent utilization of currently non-runnable tasks on that CPU.

I think we don't want to include currently non-runnable tasks in the
statistics we use, because we are trying to figure out if the cpu is a
idle-enough target based on the tasks which are currently running, for the
purpose of runqueue selection when waking up a task which is considered at
that point in time a non-runnable task on that cpu, and which is about to
become runnable again.


Although LOAD_AVG_MAX is not the max possible load_avg, we still want to find
a proper threshold to decide if the CPU is almost idle. The LOAD_AVG_MAX
based threshold is modified a little bit:

The theory is, if there is only 1 task on the CPU, and that task has a nice
of 0, the task runs 50 us every 1000 us, then this CPU is regarded as almost
idle.

The load_sum of the task is:
50 * (1 + y + y^2 + ... + y^n)
The corresponding avg_load of the task is approximately
NICE_0_WEIGHT * load_sum / LOAD_AVG_MAX = 50.
So:

/* which is close to LOAD_AVG_MAX/1000 = 47 */
#define ALMOST_IDLE_CPU_LOAD 50

Sorry to be slow at understanding this concept, but this whole "load" value is still somewhat magic to me.

Should it vary based on CONFIG_HZ_{100,250,300,1000}, or is it independent ? Where is it documented that the load is a value in "us" out of a window of 1000 us ?

And with this value "50", it would cover the case where there is only a single task taking less than 50us per 1000us, and cases where the sum for the set of tasks on the runqueue is taking less than 50us per 1000us overall.


static bool
almost_idle_cpu(int cpu, struct task_struct *p)
{
if (!sched_feat(WAKEUP_BIAS_PREV_IDLE))
return false;
return cpu_load_without(cpu_rq(cpu), p) <= ALMOST_IDLE_CPU_LOAD;
}

Tested this on Intel Xeon Platinum 8360Y, Ice Lake server, 36 core/package,
total 72 core/144 CPUs. Slight improvement is observed in hackbench socket mode:

socket mode:
hackbench -g 16 -f 20 -l 480000 -s 100

Before patch:
Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks)
Each sender will pass 480000 messages of 100 bytes
Time: 81.084

After patch:
Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks)
Each sender will pass 480000 messages of 100 bytes
Time: 78.083


pipe mode:
hackbench -g 16 -f 20 --pipe -l 480000 -s 100

Before patch:
Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks)
Each sender will pass 480000 messages of 100 bytes
Time: 38.219

After patch:
Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks)
Each sender will pass 480000 messages of 100 bytes
Time: 38.348

It suggests that, if the workload has larger working-set/cache footprint, waking up
the task on its previous CPU could get more benefit.

In those tests, what is the average % of idleness of your cpus ?

Thanks,

Mathieu


thanks,
Chenyu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com