[PATCH v10 0/7] feec() energy margin removal

From: Vincent Donnefort
Date: Tue Jun 07 2022 - 08:36:47 EST


Hi,

Here's a new version of the patch-set to get rid of the energy margin in
feec(). Many thanks to all for the insightful comments I got.

find_energy_efficient() (feec()) will migrate a task to save energy only if
it saves at least 6% of the total energy consumed by the system. This
conservative approach is a problem on a system where a lot of small tasks
create a huge load on the overall: very few of them will be allowed to
migrate to a smaller CPU, wasting a lot of energy. Instead of trying to
determine yet another margin, let's try to remove it.

The first elements of this patch-set are various fixes and improvement that
stabilizes task_util and ensures energy comparison fairness across all CPUs
of the topology. Only once those fixed, we can completely remove the margin
and let feec() aggressively place task and save energy.

This has been validated by two different ways:

First using LISA's eas_behaviour test suite. This is composed of a set of
scenario and verify if the task placement is optimum. No failure have been
observed and it also improved some tests such as Ramp-Down (as the
placement is now more energy oriented) and *ThreeSmall (as no bouncing
between clusters happen anymore).

* Hikey960: 100% PASSED
* DB-845C: 100% PASSED
* RB5: 100% PASSED

Second, using an Android benchmark: PCMark2 on a Pixel4, with a lot of
backports to have a scheduler as close as we can from mainline.

+------------+-----------------+-----------------+
| Test | Perf | Energy [1] |
+------------+-----------------+-----------------+
| Web2 | -0.3% pval 0.03 | -1.8% pval 0.00 |
| Video2 | -0.3% pval 0.13 | -5.6% pval 0.00 |
| Photo2 [2] | -3.8% pval 0.00 | -1% pval 0.00 |
| Writing2 | 0% pval 0.13 | -1% pval 0.00 |
| Data2 | 0% pval 0.8 | -0.43 pval 0.00 |
+------------+-----------------+-----------------+

The margin removal let the kernel make the best use of the Energy Model,
tasks are more likely to be placed where they fit and this saves a
substantial amount of energy, while having a limited impact on
performances.

[1] This is an energy estimation based on the CPU activity and the Energy
Model for this device. "All models are wrong but some are useful"; yes,
this is an imperfect estimation that doesn't take into account some idle
states and shared power rails. Nonetheless this is based on the information
the kernel has during runtime and it proves the scheduler can take better
decisions based solely on those data.

[2] This is the only performance impact observed. The debugging of this
test showed no issue with task placement. The better score was solely due
to some critical threads held on better performing CPUs. If a thread needs
a higher capacity CPU, the placement must result from a user input (with
e.g. uclamp min) instead of being artificially held on less efficient CPUs
by feec(). Notice also, the experiment didn't use the Android only
latency_sensitive feature which would hide this problem on a real-life
device.

v9 -> v10:
- Cosmetic changes for comments and commit messages. (Dietmar/Vincent G.)
- Renaming timestamp variables. (Dietmar)
- Fix for empty mask in feec(). (Dietmar)
- Collect Reviewed-by tags.

v8 -> v9:
- PELT migration decay: Fix barriers to prevent overestimation. (Vincent
G.)
- PELT migration decay: Fix CONFIG_GROUP_SCHED=n build.
- Various readbility improvements. (Dietmar)
- Collect Reviewed-by tags.

v7 -> v8:
- PELT migration decay: Refine estimation computation. (vincent G.)
- PELT migration decay: Do not apply estimation if load_avg is decayed
(Tao)
- PELT migration decay: throttled_pelt_idle update ordering for the
update_blocked_load case. (vincent G.)

v6 -> v7:
- PELT migration decay: Add missing clock_pelt_idle updates.
- PELT migration decay: Fix PELT scaling delta for CONFIG_CFS_BANDWIDTH.

v4 -> v5:
- PELT migration decay: timestamp only at idle time (Vincent G.)
- PELT migration decay: split timestamp values (enter_idle /
clock_pelt_idle) (Vincent G.)

v3 -> v4:
- Minor cosmetic changes (Dietmar)

v2 -> v3:
- feec(): introduce energy_env struct (Dietmar)
- PELT migration decay: Only apply when src CPU is idle (Vincent G.)
- PELT migration decay: Do not apply when cfs_rq is throttled
- PELT migration decay: Snapshot the lag at cfs_rq's level

v1 -> v2:
- Fix PELT migration last_update_time (previously root cfs_rq's).
- Add Dietmar's patches to refactor feec()'s CPU loop.
- feec(): renaming busy time functions get_{pd,tsk}_busy_time()
- feec(): pd_cap computation in the first for_each_cpu loop.
- feec(): create get_pd_max_util() function (previously within
compute_energy())
- feec(): rename base_energy_pd to base_energy.

Dietmar Eggemann (3):
sched, drivers: Remove max param from
effective_cpu_util()/sched_cpu_util()
sched/fair: Rename select_idle_mask to select_rq_mask
sched/fair: Use the same cpumask per-PD throughout
find_energy_efficient_cpu()

Vincent Donnefort (4):
sched/fair: Provide u64 read for 32-bits arch helper
sched/fair: Decay task PELT values during wakeup migration
sched/fair: Remove task_util from effective utilization in feec()
sched/fair: Remove the energy margin in feec()

drivers/powercap/dtpm_cpu.c | 33 +--
drivers/thermal/cpufreq_cooling.c | 6 +-
include/linux/sched.h | 2 +-
kernel/sched/core.c | 15 +-
kernel/sched/cpufreq_schedutil.c | 5 +-
kernel/sched/fair.c | 470 +++++++++++++++++++-----------
kernel/sched/pelt.h | 40 ++-
kernel/sched/sched.h | 53 +++-
8 files changed, 400 insertions(+), 224 deletions(-)

--
2.36.1.255.ge46751e96f-goog