Re: [RFC PATCH] sched/fair: Bias runqueue selection towards almost idle prev CPU

From: Mathieu Desnoyers
Date: Thu Oct 12 2023 - 16:06:44 EST

Next message: Javier Carrasco: "[PATCH 0/2] iio: Add device tree support for silabs,si7005"
Previous message: Krzysztof Kozlowski: "Re: [PATCH] arm64: dts: exynos: Add reserved memory for pstore on E850-96"
In reply to: Vincent Guittot: "Re: [RFC PATCH] sched/fair: Bias runqueue selection towards almost idle prev CPU"
Next in thread: Chen Yu: "Re: [RFC PATCH] sched/fair: Bias runqueue selection towards almost idle prev CPU"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2023-10-12 13:00, Vincent Guittot wrote:

On Thu, 12 Oct 2023 at 18:48, Mathieu Desnoyers
<mathieu.desnoyers@xxxxxxxxxxxx> wrote:

On 2023-10-12 12:24, Vincent Guittot wrote:

On Thu, 12 Oct 2023 at 17:56, Mathieu Desnoyers
<mathieu.desnoyers@xxxxxxxxxxxx> wrote:

On 2023-10-12 11:01, Vincent Guittot wrote:

On Thu, 12 Oct 2023 at 16:33, Mathieu Desnoyers
<mathieu.desnoyers@xxxxxxxxxxxx> wrote:

On 2023-10-11 06:16, Chen Yu wrote:

On 2023-10-10 at 09:49:54 -0400, Mathieu Desnoyers wrote:

On 2023-10-09 01:14, Chen Yu wrote:

On 2023-09-30 at 07:45:38 -0400, Mathieu Desnoyers wrote:

On 9/30/23 03:11, Chen Yu wrote:

Hi Mathieu,

On 2023-09-29 at 14:33:50 -0400, Mathieu Desnoyers wrote:

Introduce the WAKEUP_BIAS_PREV_IDLE scheduler feature. It biases
select_task_rq towards the previous CPU if it was almost idle
(avg_load <= 0.1%).

Yes, this is a promising direction IMO. One question is that,
can cfs_rq->avg.load_avg be used for percentage comparison?
If I understand correctly, load_avg reflects that more than
1 tasks could have been running this runqueue, and the
load_avg is the direct proportion to the load_weight of that
cfs_rq. Besides, LOAD_AVG_MAX seems to not be the max value
that load_avg can reach, it is the sum of
1024 * (y + y^1 + y^2 ... )

For example,
taskset -c 1 nice -n -20 stress -c 1
cat /sys/kernel/debug/sched/debug | grep 'cfs_rq\[1\]' -A 12 | grep "\.load_avg"
.load_avg : 88763
.load_avg : 1024

88763 is higher than LOAD_AVG_MAX=47742

I would have expected the load_avg to be limited to LOAD_AVG_MAX somehow,
but it appears that it does not happen in practice.

That being said, if the cutoff is really at 0.1% or 0.2% of the real max,
does it really matter ?

Maybe the util_avg can be used for precentage comparison I suppose?

[...]

Or
return cpu_util_without(cpu_rq(cpu), p) * 1000 <= capacity_orig_of(cpu) ?

Unfortunately using util_avg does not seem to work based on my testing.
Even at utilization thresholds at 0.1%, 1% and 10%.

Based on comments in fair.c:

* CPU utilization is the sum of running time of runnable tasks plus the
* recent utilization of currently non-runnable tasks on that CPU.

I think we don't want to include currently non-runnable tasks in the
statistics we use, because we are trying to figure out if the cpu is a
idle-enough target based on the tasks which are currently running, for the
purpose of runqueue selection when waking up a task which is considered at
that point in time a non-runnable task on that cpu, and which is about to
become runnable again.

Although LOAD_AVG_MAX is not the max possible load_avg, we still want to find
a proper threshold to decide if the CPU is almost idle. The LOAD_AVG_MAX
based threshold is modified a little bit:

The theory is, if there is only 1 task on the CPU, and that task has a nice
of 0, the task runs 50 us every 1000 us, then this CPU is regarded as almost
idle.

The load_sum of the task is:
50 * (1 + y + y^2 + ... + y^n)
The corresponding avg_load of the task is approximately
NICE_0_WEIGHT * load_sum / LOAD_AVG_MAX = 50.
So:

/* which is close to LOAD_AVG_MAX/1000 = 47 */
#define ALMOST_IDLE_CPU_LOAD 50

Sorry to be slow at understanding this concept, but this whole "load" value
is still somewhat magic to me.

Should it vary based on CONFIG_HZ_{100,250,300,1000}, or is it independent ?
Where is it documented that the load is a value in "us" out of a window of
1000 us ?

My understanding is that, the load_sum of a single task is a value in "us" out
of a window of 1000 us, while the load_avg of the task will multiply the weight
of the task. In this case a task with nice 0 is NICE_0_WEIGHT = 1024.

__update_load_avg_se -> ___update_load_sum calculate the load_sum of a task(there
is comments around ___update_load_sum to describe the pelt calculation),
and ___update_load_avg() calculate the load_avg based on the task's weight.

Thanks for your thorough explanation, now it makes sense.

I understand as well that the cfs_rq->avg.load_sum is the result of summing
each task load_sum multiplied by their weight:

Please don't use load_sum but only *_avg.
As already said, util_avg or runnable_avg are better metrics for you

I think I found out why using util_avg was not working for me.

Considering this comment from cpu_util():

* CPU utilization is the sum of running time of runnable tasks plus the
* recent utilization of currently non-runnable tasks on that CPU.

I don't want to include the recent utilization of currently non-runnable
tasks on that CPU in order to choose that CPU to do task placement in a
context where many tasks were recently running on that cpu (but are
currently blocked). I do not want those blocked tasks to be part of the
avg.

But you have the exact same behavior with load_sum/avg.

So I think the issue here is that I was using the cpu_util() (and
cpu_util_without()) helpers which are considering max(util, runnable),
rather than just "util".

cpu_util_without() only use util_avg but not runnable_avg.

Ah, yes, @boost=0, which prevents it from using the runnable_avg.

Nevertheless, cpu_util_without ans cpu_util uses util_est which is
used to predict the final utilization.

Yes, I suspect it's the util_est which prevents me from getting
performance improvements when I use cpu_util_without to implement
almost-idle.

Let's take the example of task A running 20ms every 200ms on CPU0.
The util_avg of the cpu will vary in the range [7:365]. When task A
wakes up on CPU0, CPU0 util_avg = 7 (below 1%) but taskA will run for
20ms which is not really almost idle. On the other side, CPU0 util_est
will be 365 as soon as task A is enqueued (which will be the value of
CPU0 util_avg just before going idle)

If task A sleeps (becomes non-runnable) without being migrated, and therefore
still have CPU0 as its cpu, is it still considered as part of the util_est of
CPU0 while it is blocked ? If it is the case, then the util_est is preventing

No, the util_est of a cpu only accounts the runnable tasks not the sleeping one

CPU's util_est = /Sum of the util_est of runnable tasks

OK, after further testing, it turns out that cpu_util_without() works for my case now. I'm not sure what I got wrong in my past attempts.

rq selection from considering a rq almost idle when waking up sleeping tasks
due to taking into account the set of sleeping tasks in its utilization estimate.

Let's now take a task B running 100us every 1024us
The util_avg of the cpu should vary in the range [101:103] and once
task B is enqueued, CPU0 util_est will be 103

Based on your comments, just doing this to match a rq util_avg <= 1% (10us of 1024us)

it's not 10us of 1024us

Is the range of util_avg within [0..1024] * capacity_of(cpu), or am I missing something ?

util_avg is in the range [0:1024] and CPU's capacity is in the range
[0:1024] too. 1024 is the compute capacity of the most powerful CPU of
the system. On SMP system, all CPUs have a capacity of 1024. On
heterogeneous system, big core have a capacity of 1024 and others will
have a lower capacity

Sounds good, I will prepare an updated patch.

Thanks,

Mathieu

Thanks,

Mathieu

seems to work fine:

return cpu_rq(cpu)->cfs.avg.util_avg <= 10 * capacity_of(cpu);

Is this approach acceptable ?

Thanks!

Mathieu

static inline void
enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
cfs_rq->avg.load_avg += se->avg.load_avg;
cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum;
}

Therefore I think we need to multiply the load_sum value we aim for by
get_pelt_divider(&cpu_rq(cpu)->cfs.avg) to compare it to a rq load_sum.

I plan to compare the rq load sum to "10 * get_pelt_divider(&cpu_rq(cpu)->cfs.avg)"
to match runqueues which were previously idle (therefore with prior periods contribution
to the rq->load_sum being pretty much zero), and which have a current period rq load_sum
below or equal 10us per 1024us (<= 1%):

static inline unsigned long cfs_rq_weighted_load_sum(struct cfs_rq *cfs_rq)
{
return cfs_rq->avg.load_sum;
}

static unsigned long cpu_weighted_load_sum(struct rq *rq)
{
return cfs_rq_weighted_load_sum(&rq->cfs);
}

/*
* A runqueue is considered almost idle if:
*
* cfs_rq->avg.load_sum / get_pelt_divider(&cfs_rq->avg) / 1024 <= 1%
*
* This inequality is transformed as follows to minimize arithmetic:
*
* cfs_rq->avg.load_sum <= get_pelt_divider(&cfs_rq->avg) * 10
*/
static bool
almost_idle_cpu(int cpu, struct task_struct *p)
{
if (!sched_feat(WAKEUP_BIAS_PREV_IDLE))
return false;
return cpu_weighted_load_sum(cpu_rq(cpu)) <= 10 * get_pelt_divider(&cpu_rq(cpu)->cfs.avg);
}

Does it make sense ?

Thanks,

Mathieu

And with this value "50", it would cover the case where there is only a
single task taking less than 50us per 1000us, and cases where the sum for
the set of tasks on the runqueue is taking less than 50us per 1000us
overall.

static bool
almost_idle_cpu(int cpu, struct task_struct *p)
{
if (!sched_feat(WAKEUP_BIAS_PREV_IDLE))
return false;
return cpu_load_without(cpu_rq(cpu), p) <= ALMOST_IDLE_CPU_LOAD;
}

Tested this on Intel Xeon Platinum 8360Y, Ice Lake server, 36 core/package,
total 72 core/144 CPUs. Slight improvement is observed in hackbench socket mode:

socket mode:
hackbench -g 16 -f 20 -l 480000 -s 100

Before patch:
Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks)
Each sender will pass 480000 messages of 100 bytes
Time: 81.084

After patch:
Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks)
Each sender will pass 480000 messages of 100 bytes
Time: 78.083

pipe mode:
hackbench -g 16 -f 20 --pipe -l 480000 -s 100

Before patch:
Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks)
Each sender will pass 480000 messages of 100 bytes
Time: 38.219

After patch:
Running in process mode with 16 groups using 40 file descriptors each (== 640 tasks)
Each sender will pass 480000 messages of 100 bytes
Time: 38.348

It suggests that, if the workload has larger working-set/cache footprint, waking up
the task on its previous CPU could get more benefit.

In those tests, what is the average % of idleness of your cpus ?

For hackbench -g 16 -f 20 --pipe -l 480000 -s 100, it is around 8~10% idle
For hackbench -g 16 -f 20 -l 480000 -s 100, it is around 2~3% idle

Then the CPUs in packge 1 are offlined to get stable result when the group number is low.
hackbench -g 1 -f 20 --pipe -l 480000 -s 100
Some CPUs are busy, others are idle, and some are half-busy.
Core CPU Busy%
- - 49.57
0 0 1.89
0 72 75.55
1 1 100.00
1 73 0.00
2 2 100.00
2 74 0.00
3 3 100.00
3 75 0.01
4 4 78.29
4 76 17.72
5 5 100.00
5 77 0.00

hackbench -g 1 -f 20 -l 480000 -s 100
Core CPU Busy%
- - 48.29
0 0 57.94
0 72 21.41
1 1 83.28
1 73 0.00
2 2 11.44
2 74 83.38
3 3 21.45
3 75 77.27
4 4 26.89
4 76 80.95
5 5 5.01
5 77 83.09

echo NO_WAKEUP_BIAS_PREV_IDLE > /sys/kernel/debug/sched/features
hackbench -g 1 -f 20 --pipe -l 480000 -s 100
Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks)
Each sender will pass 480000 messages of 100 bytes
Time: 9.434

echo WAKEUP_BIAS_PREV_IDLE > /sys/kernel/debug/sched/features
hackbench -g 1 -f 20 --pipe -l 480000 -s 100
Running in process mode with 1 groups using 40 file descriptors each (== 40 tasks)
Each sender will pass 480000 messages of 100 bytes
Time: 9.373

thanks,
Chenyu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Next message: Javier Carrasco: "[PATCH 0/2] iio: Add device tree support for silabs,si7005"
Previous message: Krzysztof Kozlowski: "Re: [PATCH] arm64: dts: exynos: Add reserved memory for pstore on E850-96"
In reply to: Vincent Guittot: "Re: [RFC PATCH] sched/fair: Bias runqueue selection towards almost idle prev CPU"
Next in thread: Chen Yu: "Re: [RFC PATCH] sched/fair: Bias runqueue selection towards almost idle prev CPU"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]