[PATCH 0/11 v3] sched/fair: Fix statistics with delayed dequeue

From: Vincent Guittot
Date: Mon Dec 02 2024 - 15:02:43 EST


Delayed dequeued feature keeps a sleeping sched_entitiy enqueued until its
lag has elapsed. As a result, it stays also visible in the statistics that
are used to balance the system and in particular the field h_nr_running.

This serie fixes those metrics by creating a new h_nr_runnable that tracks
only tasks that want to run. It renames h_nr_running into h_nr_runnable.

h_nr_runnable is used in several places to make decision on load balance:
- PELT runnable_avg
- deciding if a group is overloaded or has spare capacity
- numa stats
- reduced capacity management
- load balance between groups

While fixing h_nr_running, some fields have been renamed to follow the
same pattern. We now have:
- cfs.h_nr_runnable : running tasks in the hierarchy
- cfs.h_nr_queued : enqueued tasks in the hierarchy either running or
delayed dequeue
- cfs.h_nr_idle : enqueued sched idle tasks in the hierarchy

cfs.nr_running has been rename cfs.nr_queued because it includes the
delayed dequeued entities

The unused cfs.idle_nr_running has been removed

Load balance compares the number of running tasks when selecting the
busiest group or runqueue and tries to migrate a runnable task and not a
sleeping delayed dequeue one. delayed dequeue tasks are considered only
when migrating load as they continue to impact it.

It should be noticed that this serie doesn't fix the problem of delayed
dequeued tasks that can't migrate at wakeup.

Some additional cleanups have been added:
- move variable declaration at the beginning of pick_next_entity()
and dequeue_entity()
- sched_can_stop_tick() should use cfs.h_nr_queued instead of
cfs.nr_queued (previously cfs.nr_running) to know how many tasks
are running in the whole hierarchy instead of how many entities at
root level

Changes since v2:
- Fix h_nr_runnable after removing h_nr_delayed (reported by Mike and Prateek)
- Move "sched/fair: Fix sched_can_stop_tick() for fair tasks" at the
beginning of the series so it can be easily backported (asked by Prateek)
- Split "sched/fair: Add new cfs_rq.h_nr_runnable" in 2 patches. One
for the creation of h_nr_runnable and one for its use (asked by Peter)
- Fix more variable declarations (reported Prateek)


Changes since v1:
- reorder the patches
- rename fields into:
- h_nr_queued for all tasks queued both runnable and delayed dequeue
- h_nr_runnable for all runnable tasks
- h_nr_idle for all tasks with sched_idle policy
- Cleanup how h_nr_runnable is updated in enqueue_task_fair() and
dequeue_entities

Peter Zijlstra (1):
sched/eevdf: More PELT vs DELAYED_DEQUEUE

Vincent Guittot (10):
sched/fair: Fix sched_can_stop_tick() for fair tasks
sched/fair: Rename h_nr_running into h_nr_queued
sched/fair: Add new cfs_rq.h_nr_runnable
sched/fair: Use the new cfs_rq.h_nr_runnable
sched/fair: Removed unsued cfs_rq.h_nr_delayed
sched/fair: Rename cfs_rq.idle_h_nr_running into h_nr_idle
sched/fair: Remove unused cfs_rq.idle_nr_running
sched/fair: Rename cfs_rq.nr_running into nr_queued
sched/fair: Do not try to migrate delayed dequeue task
sched/fair: Fix variable declaration position

kernel/sched/core.c | 4 +-
kernel/sched/debug.c | 14 ++-
kernel/sched/fair.c | 240 ++++++++++++++++++++++++-------------------
kernel/sched/pelt.c | 4 +-
kernel/sched/sched.h | 12 +--
5 files changed, 153 insertions(+), 121 deletions(-)

--
2.43.0