Re: [PATCH v2] sched: async unthrottling for cfs bandwidth

From: Peter Zijlstra
Date: Mon Oct 31 2022 - 09:04:38 EST


On Wed, Oct 26, 2022 at 03:44:49PM -0700, Josh Don wrote:
> CFS bandwidth currently distributes new runtime and unthrottles cfs_rq's
> inline in an hrtimer callback. Runtime distribution is a per-cpu
> operation, and unthrottling is a per-cgroup operation, since a tg walk
> is required. On machines with a large number of cpus and large cgroup
> hierarchies, this cpus*cgroups work can be too much to do in a single
> hrtimer callback: since IRQ are disabled, hard lockups may easily occur.
> Specifically, we've found this scalability issue on configurations with
> 256 cpus, O(1000) cgroups in the hierarchy being throttled, and high
> memory bandwidth usage.
>
> To fix this, we can instead unthrottle cfs_rq's asynchronously via a
> CSD. Each cpu is responsible for unthrottling itself, thus sharding the
> total work more fairly across the system, and avoiding hard lockups.

So, TJ has been complaining about us throttling in kernel-space, causing
grief when we also happen to hold a mutex or some other resource and has
been prodding us to only throttle at the return-to-user boundary.

Would this be an opportune moment to do this? That is, what if we
replace this CSD with a task_work that's ran on the return-to-user path
instead?