[RFC PATCH 0/5]

From: K Prateek Nayak
Date: Wed Apr 09 2025 - 07:17:16 EST


Hello everyone,

There was some interest at OSPM'25 to explore using the push task
mechanism for idle and newidle balance. This series implements one such
idea. The main reason for the RFC is to understand if this is the
implementation people were in favor of before trying to optimize it for
all the workloads from my test setup.

Note: The current performance of the prototype is rough. I haven't
optimized it yet since I would love some feedback first on the approach.


Current approach
================

The push task framework for fair class has been cherry-pick from
Vincent's series and has been implemented for !EAS case.

This series implements the idea from Valentin [2] where, in presence of
pushable tasks, the CPU will set itself on a per-LLC "overloaded_mask".

The inter-NUMA newidle balance has been modified to traverse the CPUs
set on the overloaded mask, first in the local-LLC, and then CPUs set on
overloaded mask of other LLCs in same NUMA node with the goal of pulling
a single task towards itself rather than performing a full fledged load
balancing.

This implements some of the ideas from David Vernet's SAHRED_RUNQ
prototype [3] except, instead of a single SHARED_RUNQ per-LLC /
per-shard, the overloaded mask serves an indicator of per-CPU rq(s)
containing pushable task that can be migrated to the CPU going idle.
This avoids having a per-SHARED_RUNQ lock at the expense of maintaining
the overloaded cpumask.

The push callback itself has been modified to try push the tasks on the
pushable task list to one of the CPUs on the "nohz.idle_cpus_mask"
taking the load off of idle balancing.


Clarification required
======================

I believe using the per-CPU pushable task list as a proxy for a single
SHARED_RUNQ was the idea Peter was implying during the discussion. Is
this correct or did I completely misunderstand it? P.S. SHARED_RUNQ
could also be modelled as a large per-LLC push list.

An alternate implementation is to allow CPUs to go to idle as quickly as
possible and then rely completely on push mechanism and the
"idle_cpu_mask" to push task to an idle CPU however this puts the burden
of moving tasks on a busy overloaded CPU which may not be ideal.

Since folks mentioned using "push mechanism" for newidle balance, was
the above idea the one they had in mind?

There seems to be some clear advantage from doing a complete balance in
the newidle path. Since the schedstats are not rigged up yet for the new
approach, I'm not completely sure where the advantages vs disadvantages
are currently.

If the current approach is right, I'll dig deeper to try address all the
shortcomings of this prototype.

Systems with unified LLC will likely run into bottlenecks to maintain a
large per-LLC mask that can have multiple concurrent updates. I have
plans to implement a "sd_shard" which shards the large LLC making the
cpumask maintenance less heavy on these systems.


References
==========

[1] https://lore.kernel.org/lkml/20250302210539.1563190-6-vincent.guittot@xxxxxxxxxx/
[2] https://lore.kernel.org/lkml/xhsmh1putoxbz.mognet@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
[3] https://lore.kernel.org/lkml/20231212003141.216236-1-void@xxxxxxxxxxxxx/

--
K Prateek Nayak (4):
sched/fair: Introduce overloaded_mask in sched_domain_shared
sched/fair: Update overloaded mask in presence of pushable task
sched/fair: Rework inter-NUMA newidle balancing
sched/fair: Proactive idle balance using push mechanism

Vincent Guittot (1):
sched/fair: Add push task framework

include/linux/sched/topology.h | 1 +
kernel/sched/fair.c | 297 +++++++++++++++++++++++++++++++--
kernel/sched/sched.h | 2 +
kernel/sched/topology.c | 25 ++-
4 files changed, 306 insertions(+), 19 deletions(-)


base-commit: 6432e163ba1b7d80b5876792ce53e511f041ab91
--
2.34.1