Re: [PATCH v2] sched: async unthrottling for cfs bandwidth

From: Benjamin Segall
Date: Mon Oct 31 2022 - 17:56:22 EST


Peter Zijlstra <peterz@xxxxxxxxxxxxx> writes:

> On Wed, Oct 26, 2022 at 03:44:49PM -0700, Josh Don wrote:
>> CFS bandwidth currently distributes new runtime and unthrottles cfs_rq's
>> inline in an hrtimer callback. Runtime distribution is a per-cpu
>> operation, and unthrottling is a per-cgroup operation, since a tg walk
>> is required. On machines with a large number of cpus and large cgroup
>> hierarchies, this cpus*cgroups work can be too much to do in a single
>> hrtimer callback: since IRQ are disabled, hard lockups may easily occur.
>> Specifically, we've found this scalability issue on configurations with
>> 256 cpus, O(1000) cgroups in the hierarchy being throttled, and high
>> memory bandwidth usage.
>>
>> To fix this, we can instead unthrottle cfs_rq's asynchronously via a
>> CSD. Each cpu is responsible for unthrottling itself, thus sharding the
>> total work more fairly across the system, and avoiding hard lockups.
>
> So, TJ has been complaining about us throttling in kernel-space, causing
> grief when we also happen to hold a mutex or some other resource and has
> been prodding us to only throttle at the return-to-user boundary.
>
> Would this be an opportune moment to do this? That is, what if we
> replace this CSD with a task_work that's ran on the return-to-user path
> instead?

This is unthrottle, not throttle, but it would probably be
straightfoward enough to do what you said for throttle. I'd expect this
to not help all that much though, because throttle hits the entire
cfs_rq, not individual threads.

I'm currently trying something more invasive, which doesn't throttle a
cfs_rq while it has any kernel tasks, and prioritizes kernel tasks / ses
containing kernel tasks when a cfs_rq "should" be throttled. "Invasive"
is a key word though, as it needs to do the sort of h_nr_kernel_tasks
tracking on put_prev/set_next in ways we currently only need to do on
enqueue/dequeue.