Re: [PATCH 0/5] sched/fair: Rework EAS to handle more cases

From: Pierre Gondois
Date: Fri Nov 08 2024 - 08:11:06 EST

Next message: Robin Murphy: "Re: [PATCH] iommu/dma: Reserve iova ranges for reserved regions of all devices"
Previous message: Christian Marangi: "Re: [net-next PATCH v3 3/3] net: phy: Add Airoha AN8855 Internal Switch Gigabit PHY"
In reply to: Vincent Guittot: "Re: [PATCH 0/5] sched/fair: Rework EAS to handle more cases"
Next in thread: Vincent Guittot: "Re: [PATCH 0/5] sched/fair: Rework EAS to handle more cases"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 11/8/24 10:27, Vincent Guittot wrote:

Hi Pierre,

On Thu, 7 Nov 2024 at 11:14, Pierre Gondois <pierre.gondois@xxxxxxx> wrote:

Hello Vincent,
Related to feec(), but not to this patchset, I think there might be a
concurrency issue while running feec().

yes, this is a know limitation

Feec() doesn't have any locking mechanism. This means that multiple CPUs
might run the function at the same time.

this is done on purpose as we don't want to lock and slow down the wakeup path

Yes right, this is understandable. However there could be a way to bail out of
feec() when such case is detected without actually waiting for a lock (cf. the
prototype).
We already bail out of feec() when the utilization of a CPU without a task is
higher than with the task in the energy computation.

If:
- 2 tasks with approximately the same utilization wake up at the same time
- some space on an energy efficient CPU is available
feec() will likely select the same target for the 2 tasks.

yes

Once feec() determined a target for a task, util signals are updated in
enqueue_task_fair(). The delta between running feec() <-> enqueue_task_fair()
is ~20us (on a Pixel6). This is not much, but this still allows some other

20us is quite long. this is the worst case on little core lowest freq ?

I only kept the occurrences where feec() ends up with a target != prev_cpu.
In these case enqueuing is done on the target CPU (cf. __ttwu_queue_wakelist),
which might take more time.

In the other case, the delta is effectively lower (~10us).

CPUs to run feec() util signals that will be wrong in no time.

Note that it is also possible for one CPU to run feec() for 2 different tasks,
decide to migrate the 2 tasks to another target CPU, and then start enqueueing
the tasks. Meaning one single CPU will run feec() using util signals it knows
are wrong.

isn't this case serialized because cpu selection for next task will
happen after enqueuing the 1st one

I'm not sure I understand the question, but if the enqueue is done on the
target CPU, the running CPU might call feec() in the meantime.

The issue is problematic as it creates some instability. Once a
'parallel selection' is done, the following scenarios can happen:
- the system goes overutilized, and EAS is disabled
- a frequency spike happen to handle the unexpected load.
Then the perf. domain becomes less energy efficient compared to other
perf. domains, and tasks are migrated out of this perf. domain

I made the following prototype to avoid 'parallel selections'. The goal here
is to tag CPUs that are under pending migration.
A target CPU is tagged as 'eas_pending_enqueue' at the end of feec(). Other
CPUs should therefore not consider this CPU as valid candidate.

The implementation is a bit raw, but it gives some good results. Using rt-app
workloads, and trying not to have tasks waking up at the same timing during
the whole test:
Workload1:
N tasks with a period of 16ms and a util of 4/8. Each task starts with a
4ms delay. Each workload lasts 20s and is run over 5 iterations.

Workload2:
N tasks with a period of (8 +n)ms and a util of 4/8. I.e. the first task
has a period of 8ms, the second task a period of 9ms, etc. Each workload lasts
20s and is run over 5 iterations.

Are presented:
- the measured energy consumed, according to the Pixel6 energy meters
- the estimated energy consumed, lisa uses the util signals along with
the CPU frequencies and the Energy Model to do an estimation.
- the amount of time spent in the overutilized state, in percentage.

------

Workload1:

Measured energy:
+------+-------+--------------+--------------+------------+
| util | count | without | with | ratio |
+------+-------+--------------+--------------+------------+
| 4 | 8 | 3220.970324 | 3312.097508 | 2.829184 |
| 4 | 12 | 5942.486726 | 5016.106047 | -15.589108 |
| 4 | 16 | 10412.26692 | 10017.633658 | -3.79008 |
| 8 | 8 | 7524.271751 | 7479.451427 | -0.595677 |
| 8 | 12 | 14782.214144 | 14567.282266 | -1.45399 |
| 8 | 16 | 21452.863497 | 19561.143385 | -8.818031 |
+------+-------+--------------+--------------+------------+
Std:
+------+-------+-------------+-------------+
| util | count | without | with |
+------+-------+-------------+-------------+
| 4 | 8 | 165.563394 | 48.903514 |
| 4 | 12 | 518.609612 | 81.170952 |
| 4 | 16 | 329.729882 | 192.739245 |
| 8 | 8 | 105.144497 | 336.796522 |
| 8 | 12 | 384.615323 | 339.86986 |
| 8 | 16 | 1252.735561 | 2563.268952 |
+------+-------+-------------+-------------+

Estimated energy:
+------+-------+-----------+-----------+------------+
| util | count | without | with | ratio |
+------+-------+-----------+-----------+------------+
| 4 | 8 | 1.4372e10 | 1.2791e10 | -11.000273 |
| 4 | 12 | 3.1881e10 | 2.3743e10 | -25.526193 |
| 4 | 16 | 5.7663e10 | 5.4079e10 | -6.215679 |
| 8 | 8 | 2.5622e10 | 2.5337e10 | -1.109823 |
| 8 | 12 | 6.4332e10 | 6.9335e10 | 7.776814 | [1]
| 8 | 16 | 9.5285e10 | 8.2331e10 | -13.594508 |
+------+-------+-----------+-----------+------------+
Std:
+------+-------+----------+-----------+
| util | count | without | with |
+------+-------+----------+-----------+
| 4 | 8 | 1.3896e9 | 5.4265e8 |
| 4 | 12 | 4.7511e9 | 5.1521e8 |
| 4 | 16 | 3.5486e9 | 1.2625e9 |
| 8 | 8 | 3.0033e8 | 2.3168e9 |
| 8 | 12 | 8.7739e9 | 3.0743e9 |
| 8 | 16 | 6.7982e9 | 2.2393e10 |
+------+-------+----------+-----------+

Overutilized ratio (in % of the 20s test):
+------+-------+-----------+-----------+------------+
| util | count | without | with | ratio |
+------+-------+-----------+-----------+------------+
| 4 | 8 | 0.187941 | 0.015834 | -91.575158 |
| 4 | 12 | 0.543073 | 0.045483 | -91.624815 |
| 4 | 16 | 8.510734 | 8.389077 | -1.429448 |
| 8 | 8 | 1.056678 | 0.876095 | -17.089643 |
| 8 | 12 | 36.457757 | 9.260862 | -74.598378 | [1]
| 8 | 16 | 72.327933 | 78.693558 | 8.801061 |
+------+-------+-----------+-----------+------------+
Std:
+------+-------+-----------+-----------+
| util | count | without | with |
+------+-------+-----------+-----------+
| 4 | 8 | 0.232077 | 0.016531 |
| 4 | 12 | 0.338637 | 0.040252 |
| 4 | 16 | 0.729743 | 6.368214 |
| 8 | 8 | 1.702964 | 1.722589 |
| 8 | 12 | 34.436278 | 17.314564 |
| 8 | 16 | 14.540217 | 33.77831 |
+------+-------+-----------+-----------+

------

Workload2:

Measured energy:
+------+-------+--------------+--------------+-----------+
| util | count | without | with | ratio |
+------+-------+--------------+--------------+-----------+
| 4 | 8 | 3357.578785 | 3324.890715 | -0.973561 |
| 4 | 12 | 5024.573746 | 4903.394533 | -2.411731 |
| 4 | 16 | 10114.715431 | 9762.803821 | -3.479204 |
| 8 | 8 | 7485.230678 | 6961.782086 | -6.993086 |
| 8 | 12 | 13720.482516 | 13374.765825 | -2.519712 |
| 8 | 16 | 24846.806317 | 24444.012805 | -1.621108 |
+------+-------+--------------+--------------+-----------+
Std:
+------+-------+------------+------------+
| util | count | without | with |
+------+-------+------------+------------+
| 4 | 8 | 87.450628 | 76.955783 |
| 4 | 12 | 106.062839 | 116.882891 |
| 4 | 16 | 182.525881 | 172.819307 |
| 8 | 8 | 874.292359 | 162.790237 |
| 8 | 12 | 151.830636 | 339.286741 |
| 8 | 16 | 904.751446 | 154.419644 |
+------+-------+------------+------------+

Estimated energy:
+------+-------+-----------+-----------+------------+
| util | count | without | with | ratio |
+------+-------+-----------+-----------+------------+
| 4 | 8 | 1.4778e10 | 1.4805e10 | 0.184658 |
| 4 | 12 | 2.6105e10 | 2.5485e10 | -2.374486 |
| 4 | 16 | 5.8394e10 | 5.7177e10 | -2.083208 |
| 8 | 8 | 3.0275e10 | 2.5973e10 | -14.211178 |
| 8 | 12 | 7.0616e10 | 6.9085e10 | -2.168347 |
| 8 | 16 | 1.3133e11 | 1.2891e11 | -1.839725 |
+------+-------+-----------+-----------+------------+
Std:
+------+-------+----------+----------+
| util | count | without | with |
+------+-------+----------+----------+
| 4 | 8 | 3.5449e8 | 8.2454e8 |
| 4 | 12 | 9.4248e8 | 1.1364e9 |
| 4 | 16 | 8.3240e8 | 1.2084e9 |
| 8 | 8 | 9.0364e9 | 5.0381e8 |
| 8 | 12 | 9.9112e8 | 3.0836e9 |
| 8 | 16 | 4.9429e8 | 1.9533e9 |
+------+-------+----------+----------+

Overutilized ratio (in % of the 20s test):
+------+-------+-----------+----------+------------+
| util | count | without | with | ratio |
+------+-------+-----------+----------+------------+
| 4 | 8 | 0.154992 | 0.049429 | -68.108419 |
| 4 | 12 | 0.132593 | 0.061762 | -53.420202 |
| 4 | 16 | 6.798091 | 4.606102 | -32.244179 |
| 8 | 8 | 1.360703 | 0.174626 | -87.166465 |
| 8 | 12 | 0.519704 | 0.250469 | -51.805502 |
| 8 | 16 | 12.114269 | 8.969281 | -25.961019 |
+------+-------+-----------+----------+------------+
Std:
+------+-------+----------+----------+
| util | count | without | with |
+------+-------+----------+----------+
| 4 | 8 | 0.212919 | 0.036856 |
| 4 | 12 | 0.069696 | 0.060257 |
| 4 | 16 | 0.63995 | 0.542028 |
| 8 | 8 | 2.158079 | 0.211775 |
| 8 | 12 | 0.089159 | 0.187436 |
| 8 | 16 | 0.798565 | 1.669003 |
+------+-------+----------+----------+

------

Analysis:

- [1]
Without the patch, 2 tasks end up on one little CPU. This consumes
less energy than using the medium/big CPU according to the energy model,
but EAS should not be capable of doing such task placement as the little
CPU becomes overutilized.
Without the patch, the system is overutilized a lot more than with the patch.

-
Looking at the overutilized ratio, being overutilized 0.5% of the time or
0.05% of the time might seem close, but it means that EAS ended up
doing a bad task placement multiple, independent times.

-
The overutilized ratio should be checked along the energy results as it
shows how much EAS was involved in the task placement.

-
Overall, the energy consumed is less. The quantity of energy saved varies
with the workload.

------

On another note, I wanted to ask if there would be a v2 of this present
patchset (sched/fair: Rework EAS to handle more cases),

yes, I have been side tracked by other stuff since LPC and haven't
been able to finalize test on v2 but it's ongoing

Ok thanks!

Regards,
Pierre

------

diff --git a/include/linux/sched.h b/include/linux/sched.h
index bb343136ddd0..812d5bf88875 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1592,6 +1592,12 @@ struct task_struct {
struct user_event_mm *user_event_mm;
#endif

+ /*
+ * Keep track of the CPU feec() migrated this task to.
+ * There is a per-cpu 'eas_pending_enqueue' value to reset.
+ */
+ int eas_target_cpu;
+
/*
* New fields for task_struct should be added above here, so that
* they are included in the randomized portion of task_struct.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c157d4860a3b..34911eb059cf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6945,6 +6945,8 @@ requeue_delayed_entity(struct sched_entity *se)
se->sched_delayed = 0;
}

+DEFINE_PER_CPU(atomic_t, eas_pending_enqueue);
+
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
@@ -7064,6 +7066,11 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
check_update_overutilized_status(rq);

enqueue_throttle:
+ if (p->eas_target_cpu != -1) {
+ atomic_set(&per_cpu(eas_pending_enqueue, p->eas_target_cpu), 0);
+ p->eas_target_cpu = -1;
+ }
+
assert_list_leaf_cfs_rq(rq);

hrtick_update(rq);
@@ -8451,6 +8458,11 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
if (!cpumask_test_cpu(cpu, p->cpus_ptr))
continue;

+ /* Skip this CPU as its util signal will be invalid soon. */
+ if (atomic_read(&per_cpu(eas_pending_enqueue, cpu)) &&
+ cpu != prev_cpu)
+ continue;
+
util = cpu_util(cpu, p, cpu, 0);
cpu_cap = capacity_of(cpu);

@@ -8560,6 +8572,17 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
((best_fits < 0) && (best_actual_cap > prev_actual_cap)))
target = best_energy_cpu;

+ /*
+ *'Lock' the target CPU if there is a migration. Prevent other feec()
+ * calls to use the same target CPU until util signals are not updated.
+ */
+ if (prev_cpu != target) {
+ if (!atomic_cmpxchg_acquire(&per_cpu(eas_pending_enqueue, target), 0, 1))
+ p->eas_target_cpu = target;
+ else
+ target = prev_cpu;
+ }
+
return target;

unlock:

Next message: Robin Murphy: "Re: [PATCH] iommu/dma: Reserve iova ranges for reserved regions of all devices"
Previous message: Christian Marangi: "Re: [net-next PATCH v3 3/3] net: phy: Add Airoha AN8855 Internal Switch Gigabit PHY"
In reply to: Vincent Guittot: "Re: [PATCH 0/5] sched/fair: Rework EAS to handle more cases"
Next in thread: Vincent Guittot: "Re: [PATCH 0/5] sched/fair: Rework EAS to handle more cases"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]