Re: [PATCH v2] sched: async unthrottling for cfs bandwidth

From: Josh Don
Date: Mon Oct 31 2022 - 17:22:59 EST


Hey Peter,


On Mon, Oct 31, 2022 at 6:04 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Wed, Oct 26, 2022 at 03:44:49PM -0700, Josh Don wrote:
> > CFS bandwidth currently distributes new runtime and unthrottles cfs_rq's
> > inline in an hrtimer callback. Runtime distribution is a per-cpu
> > operation, and unthrottling is a per-cgroup operation, since a tg walk
> > is required. On machines with a large number of cpus and large cgroup
> > hierarchies, this cpus*cgroups work can be too much to do in a single
> > hrtimer callback: since IRQ are disabled, hard lockups may easily occur.
> > Specifically, we've found this scalability issue on configurations with
> > 256 cpus, O(1000) cgroups in the hierarchy being throttled, and high
> > memory bandwidth usage.
> >
> > To fix this, we can instead unthrottle cfs_rq's asynchronously via a
> > CSD. Each cpu is responsible for unthrottling itself, thus sharding the
> > total work more fairly across the system, and avoiding hard lockups.
>
> So, TJ has been complaining about us throttling in kernel-space, causing
> grief when we also happen to hold a mutex or some other resource and has
> been prodding us to only throttle at the return-to-user boundary.

Yea, we've been having similar priority inversion issues. It isn't
limited to CFS bandwidth though, such problems are also pretty easy to
hit with configurations of shares, cpumasks, and SCHED_IDLE. I've
chatted with the folks working on the proxy execution patch series,
and it seems like that could be a better generic solution to these
types of issues.

Throttle at return-to-user seems only mildly beneficial, and then only
really with preemptive kernels. Still pretty easy to get inversion
issues, e.g. a thread holding a kernel mutex wake back up into a
hierarchy that is currently throttled, or a thread holding a kernel
mutex exists in the hierarchy being throttled but is currently waiting
to run.

> Would this be an opportune moment to do this? That is, what if we
> replace this CSD with a task_work that's ran on the return-to-user path
> instead?

The above comment is about when we throttle, whereas this patch is
about the unthrottle case. I think you're asking why don't we
unthrottle using e.g. task_work assigned to whatever the current task
is? That would work around the issue of keeping IRQ disabled for long
periods, but still forces one cpu to process everything, which can
take quite a while.

Thanks,
Josh