Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode

From: K Prateek Nayak
Date: Thu Feb 20 2025 - 22:38:18 EST


Hello Josh,

Thank you for sharing the background!

On 2/21/2025 7:34 AM, Josh Don wrote:
On Thu, Feb 20, 2025 at 4:04 AM K Prateek Nayak <kprateek.nayak@xxxxxxx> wrote:

Hello Peter,

On 2/20/2025 5:02 PM, Peter Zijlstra wrote:
On Thu, Feb 20, 2025 at 04:48:58PM +0530, K Prateek Nayak wrote:
Any and all feedback is appreciated :)

Pfff.. I hate it all :-)

So the dequeue approach puts the pain on the people actually using the
bandwidth crud, while this 'some extra accounting' crap has *everybody*
pay for this nonsense, right?

Doing the context tracking could also provide benefit beyond CFS
bandwidth. As an example, we often see a pattern where a thread
acquires one mutex, then sleeps on trying to take a second mutex. When
the thread eventually is woken due to the second mutex now being
available, the thread now needs to wait to get back on cpu, which can
take an arbitrary amount of time depending on where it landed in the
tree, its weight, etc. Other threads trying to acquire that first
mutex now experience priority inversion as they must wait for the
original thread to get back on cpu and release the mutex. Re-using the
same context tracking, we could prioritize execution of threads in
kernel critical sections, even if they aren't the fair next choice.

Just out of curiosity, have you tried running with proxy-execution [1][2]
on your deployments to mitigate priority inversion in mutexes? I've
tested it with smaller scale benchmarks and I haven't seem much overhead
except for in case of a few microbenchmarks but I'm not sure if you've
run into any issues at your scale.

[1] https://lore.kernel.org/lkml/20241125195204.2374458-1-jstultz@xxxxxxxxxx/
[2] https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v14-6.13-rc1/


If that isn't convincing enough, we could certainly throw another
kconfig or boot param for this behavior :)

Is the expectation that these deployments have to be managed more
smartly if we move to a per-task throttling model? Else it is just
hard lockup by a thousand tasks.

+1, I don't see the per-task throttling being able to scale here.

If Ben or Josh can comment on any scalability issues they might have
seen on their deployment and any learning they have drawn from them
since LPC'24, it would be great. Any stats on number of tasks that
get throttled at one go would also be helpful.

Maybe just to emphasize that we continue to see the same type of
slowness; throttle/unthrottle when traversing a large cgroup
sub-hierarchy is still an issue for us and we're working on sending a
patch to ideally break this up to do the updates more lazily, as
described at LPC.

Is it possible to share an example hierarchy from one of your
deployments? Your presentation for LPC'24 [1] says "O(1000) cgroups" but
is it possible to reveal the kind of nesting you deal with and at which
levels are bandwidth controls set. Even something like "O(10) cgroups on
root with BW throttling set, and each of them contain O(100) cgroups
below" could also help match a test setup.

[2] https://lpc.events/event/18/contributions/1855/attachments/1436/3432/LPC%202024_%20Scalability%20BoF.pdf


In particular, throttle/unthrottle (whether it be on a group basis or
a per-task basis) is a loop that is subject to a lot of cache misses.

Best,
Josh

--
Thanks and Regards,
Prateek