[PATCH v2 0/5] sched/fair: Allow account_cfs_rq_runtime() to throttle current hierarchy
From: K Prateek Nayak
Date: Tue Jun 02 2026 - 01:00:51 EST
v2 addresses comments from Ben, Aaron, and Peter with a copule of small
optimizations on top. Individual patches contain the changelog.
Introduction
============
The current hierarchy is always throttled in __schedule() during the
pick when update_curr() detects a cfs_rq running out of the bandwidth
and issues a resched.
This was necessary prior to per-task throttling where the entire
throttled hierarchy was dequeued at the point of first throttle during
the pick but with per-task throttling, tasks continue to run as usual
until they exit to userspace and dequeue themselves one-by-one until the
hierarchy is deemed fully throttled and the PELT is frozen.
throttle_cfs_rq() is now simply a propagator of throttle indicators and
nothing more.
Implementation
==============
Unify the throttling for current hierarchy under
account_cfs_rq_runtime() which is responsible for the time accounting.
If the bandwidth runs out, account_cfs_rq_runtime() will request for
sched_cfs_bandwidth_slice() and mark the hierarchy as throttled if it
fails to grab bandwidth.
throttle_cfs_rq() will do a task_throttle_setup_work() if it finds the
current task to be on a throttled hierarchy and the task will naturally
dequeue itself when it exits to the userspace without needing an
explicit resched.
First four patches are cleanups and preparation for the final bit that
switches over to using account_cfs_rq_runtime() for throttling which was
provided by Peter in [1].
Benchmarking
============
Following are the results of running hackbench running 3 levels deep
with the setup from "Testing" section on [2] when compared to
tip:sched/core:
kernel : tip tip + series
Min : 207.33 202.20
Max : 210.20 222.47
Median : 207.83 218.33
AMean : 208.29 215.36
GMean : 208.29 215.25
HMean : 208.29 215.13
AMean Stddev : 1.02 7.37
AMean CoefVar : 0.49 pct 3.42 pct
All numbers are in seconds.
There is a slight boot to boot variation for this benchmark but the
utilization numbers in top is more or less similar between the two.
Additional testing and feedback is always appreciated as usual :-)
Patches on top of queue:sched/core at commit ce348f2b1998 ("sched/fair:
Allocate cfs_tg_state with percpu allocator"). All testing was done on a
dual socket 4th Generation EPYC system (2 x 128C/256T).
CONFIG_CFS_BANDWIDTH=n was only build tested.
Changelog
=========
v1..v2:
o Addressed comments from Ben in Patch1 to keep the call to
distribute_cfs_runtime() in do_sched_cfs_slack_timer and used
scoped_guard().
o Instead of adding back "else { throttled = true; }" in
distribute_cfs_runtime(), invert the condition to
"cfs_rq->runtime_remaining <= 0" and break early after setting
"throttled = true". (Ben)
o Added missing update_rq_clock() in tg_set_cfs_bandwidth(). (Aaron,
Peter, Intel Test Robot)
o Added an update_curr() in distribute_cfs_runtime() which was found to
be beneficial to prevent calling unthrottle_cfs_rq() unnecessarily.
o Collected tags from Ben and Aaron (Thanks a ton!).
o Rebased patches on top of:
git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/core
at commit ce348f2b1998 ("sched/fair: Allocate cfs_tg_state with percpu
allocator").
References
==========
[1] https://lore.kernel.org/lkml/20260512110932.GB1889694@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
[2] https://lore.kernel.org/lkml/20250220093257.9380-1-kprateek.nayak@xxxxxxx/
[3] https://lore.kernel.org/lkml/20260522141623.600235-4-zli94@xxxxxxxx/
---
K Prateek Nayak (4):
sched/fair: Convert cfs bandwidth throttling to use guards
sched/fair: Use throttled_csd_list for local unthrottle
sched/fair: Call update_curr() before unthrottling the hierarchy
sched/fair: Move the throttled tasks to a local list in
tg_unthrottle_up()
Peter Zijlstra (1):
sched/fair: Unify cfs_rq throttling via account_cfs_rq_runtime()
kernel/sched/core.c | 5 +-
kernel/sched/fair.c | 360 +++++++++++++++++++++++---------------------
2 files changed, 193 insertions(+), 172 deletions(-)
base-commit: ce348f2b1998b4c90053a3be407c32102b132800
--
2.34.1