Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode

From: K Prateek Nayak
Date: Thu Feb 20 2025 - 12:02:54 EST


Hello Valentin,

On 2/20/2025 9:10 PM, Valentin Schneider wrote:
On 20/02/25 12:32, Peter Zijlstra wrote:
On Thu, Feb 20, 2025 at 04:48:58PM +0530, K Prateek Nayak wrote:

The rationale there was with growing number of tasks on cfs_rq, the
throttle path has to perform a lot of dequeues and the unthrottle at
distribution has to enqueue all the dequeued threads back.

This is one way to keep all the tasks queued but allow pick to only
select among those that are preempted in kernel mode.

Since per-task throttling needs to tag, dequeue, and re-enqueue each
task, I'm putting this out as an alternate approach that does not
increase the complexities of tg_tree walks which Ben had noted on
Valentin's series [1]. Instead we retain the per cfs_rq throttling
at the cost of some stats tracking at enqueue and dequeue
boundaries.

If you have a strong feelings against any specific part, or the entirety
of this approach, please do let me know, and I'll do my best to see if
a tweaked approach or an alternate implementation can scale well with
growing thread counts (or at least try to defend the bits in question if
they hold merit still).

Any and all feedback is appreciated :)

Pfff.. I hate it all :-)

So the dequeue approach puts the pain on the people actually using the
bandwidth crud, while this 'some extra accounting' crap has *everybody*
pay for this nonsense, right?

I'm not sure how bad this extra accounting is, but I do fear death by a
thousand cuts.

FWIW that was my main worry with the dual tree approach and why I gave up
on it in favor of the per-task dequeue faff. Having the overhead mainly
contained in throttle/unthrottle is a lot more attractive than adding
(arguably small) overhead to the enqueue/dequeue paths. There was also the
headache of figuring out what to do with the .*nr_running fields and what
is reflected to load balance, which isn't an issue with the per-task thing.

I believed that with the differentiation of nr_queued and nr_runnable
now, the counts would be simpler to correct (I might be wrong).

This approach retains the single rbtree but yes there is a cost
associated with maintaining these stats. The stats collection can be
deferred until a bandwidth constraint is first enforced but yes the
small cost remains in every enqueue, dequeue, put_prev_entity,
set_next_entity path thereafter.

Arguably, this should be no more costlier than the current tracking of
h_nr_delayed + min_slice in enqueue and dequeue paths but I might be
wrong.


As pointed by Ben in [1], the issue with the per-task approach is the
scalability of the unthrottle. You have the rq lock held and you
potentially end up unthrottling a deep cgroup hierarchy, putting each
individual task back on its cfs_rq.

Agreed which is why this alternate approach to retain the throttling and
unthrottling at cfs_rq level was worth a try.


I can't find my notes on that in a hurry, but my idea with that for a next
version was to periodically release the rq lock as we go up the cgroup
hierarchy during unthrottle - the idea being that we can mess with part of
hierarchy, and as long as that part isn't connected to the rest (i.e. it's
not enqueued, like we currently do for CFS throttling), "it should be
safe".

That is pretty nifty! My only concern there would be the case where a
part of throttled hierarchy is still reachable on unthrottle but a part
has dequeued itself - some tasks might have to wait until it is queued
again to be accessible during the pick and a bunch of rescheds follow
with each batch of enqueues.


FYI I haven't given up on this, it's just that repeatedly context switching
between IPI deferral and this didn't really work for me so I'm sticking to
one 'till it gets somewhere.

Ack! This RFC was to get feedback from folks and to see if there are
any takers for cfs_rq level throttling and the reasons to move to a
per-task throttling. Safe to say I'm slowly getting some answers :)


[1]: https://lore.kernel.org/lkml/xm26y15yz0q8.fsf@xxxxxxxxxx/


--
Thanks and Regards,
Prateek