Re: [RFC PATCH 2/7] sched/fair: Handle throttle path for task based throttle

From: Aaron Lu
Date: Thu Mar 27 2025 - 23:12:11 EST


On Thu, Mar 27, 2025 at 05:11:42PM -0700, Xi Wang wrote:
> On Tue, Mar 25, 2025 at 3:02 AM Aaron Lu <ziqianlu@xxxxxxxxxxxxx> wrote:
> >
> > On Mon, Mar 24, 2025 at 04:58:22PM +0800, Aaron Lu wrote:
> > > On Thu, Mar 20, 2025 at 11:40:11AM -0700, Xi Wang wrote:
> > > ...
> > > > I am a bit unsure about the overhead experiment results. Maybe we can add some
> > > > counters to check how many cgroups per cpu are actually touched and how many
> > > > threads are actually dequeued / enqueued for throttling / unthrottling?
> > >
> > > Sure thing.
> > >
> > > > Looks like busy loop workloads were used for the experiment. With throttling
> > > > deferred to exit_to_user_mode, it would only be triggered by ticks. A large
> > > > runtime debt can accumulate before the on cpu threads are actually dequeued.
> > > > (Also noted in https://lore.kernel.org/lkml/20240711130004.2157737-11-vschneid@xxxxxxxxxx/)
> > > >
> > > > distribute_cfs_runtime would finish early if the quotas are used up by the first
> > > > few cpus, which would also result in throttling/unthrottling for only a few
> > > > runqueues per period. An intermittent workload like hackbench may give us more
> > > > information.
> > >
> > > I've added some trace prints and noticed it already invovled almost all
> > > cpu rqs on that 2sockets/384cpus test system, so I suppose it's OK to
> > > continue use that setup as described before:
> > > https://lore.kernel.org/lkml/CANCG0GdOwS7WO0k5Fb+hMd8R-4J_exPTt2aS3-0fAMUC5pVD8g@xxxxxxxxxxxxxx/
> >
> > One more data point that might be interesting. I've tested this on a
> > v5.15 based kernel where async unthrottle is not available yet so things
> > should be worse.
> >
> > As Xi mentioned, since the test program is cpu hog, I tweaked the quota
> > setting to make throttle happen more likely.
> >
> > The bpftrace duration of distribute_cfs_runtime() is:
> >
> > @durations:
> > [4K, 8K) 1 | |
> > [8K, 16K) 8 | |
> > [16K, 32K) 1 | |
> > [32K, 64K) 0 | |
> > [64K, 128K) 0 | |
> > [128K, 256K) 0 | |
> > [256K, 512K) 0 | |
> > [512K, 1M) 0 | |
> > [1M, 2M) 0 | |
> > [2M, 4M) 0 | |
> > [4M, 8M) 0 | |
> > [8M, 16M) 0 | |
> > [16M, 32M) 0 | |
> > [32M, 64M) 376 |@@@@@@@@@@@@@@@@@@@@@@@ |
> > [64M, 128M) 824 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> >
> > One random trace point from the trace prints is:
> >
> > <idle>-0 [117] d.h1. 83206.734588: distribute_cfs_runtime: cpu117: begins
> > <idle>-0 [117] dnh1. 83206.801902: distribute_cfs_runtime: cpu117: finishes: unthrottled_rqs=384, unthrottled_cfs_rq=422784, unthrottled_task=10000
> >
> > So for the above trace point, distribute_cfs_runtime() unthrottled 384
> > rqs with a total of 422784 cfs_rqs and enqueued back 10000 tasks, this
> > took about 70ms.
> >
> > Note that other things like rq lock contention might make things worse -
> > I did not notice any lock contention in this setup.
> >
> > I've attached the corresponding debug diff in case it's not clear what
> > this trace print means.
>
> Thanks for getting the test results!
>
> My understanding is that you now have 2 test configurations and new vs
> old throttling mechanisms. I think the two groups of results were
> test1 + new method and test2 + old method. Is that the case?

Sorry for the confusion.

First result is done using this patch series on top of latest
tip/sched/core branch which has async unthrottle feature. Second result
is done using this patch series(adjusted to run on an old kernel
of course) on top of v5.15 based kernel where async unthrottle is not
available yet.

>
> For test1 + new method, we have "..in distribute_cfs_runtime(), 383
> rqs are involved and the local cpu has unthrottled 1101 cfs_rqs and a
> total of 69 tasks are enqueued back". I think if the workload is in a
> steady and persistently over limit state we'd have 1000+ tasks
> periodically being throttled and unthrottled, at least with the old
> method. So "1101 cfs_rqs and a total of 69 tasks are enqueued back"
> might be a special case?

With async unthrottle, distribute_cfs_runtime() will only deal with
local cpu's unthrottle and other cpu's unthrottle is done by sending
an IPI to let those cpus deal with their own unthrottle.

Since there are a total of 2000 leaf cfs_rqs and 20K tasks, on this 384
cpus machine, each cpu should have roughly about 52 tasks. And one top
level hierarchy have about 1000 leaf cfs_rqs. That's why there are 1101
cfs_rqs unthrottled and 69 tasks are enqueued. These numbers mean on that
specific cpu alone where the hrtime fired, it iterates 1101 cfs_rqs and
enqueued back 69 tasks. All other tasks are enqueued in other CPU's IPI
handler __cfsb_csd_unthrottle(). With this said, for this setup, "1101
cfs_rqs and 69 tasks" is not a special case but a worse than normal case,
if not the worst.

When async unthrottle feature is not available(like in the 2nd result),
distribute_cfs_runtime() will have to iterate all cpu's throttled
cfs_rqs and enqueue back all those throttled tasks. Since these number
would be much larger, I showed them for the sole purpose of demonstrating
how bad the duration can be when 10K tasks have to be enqueued back in
one go.

> I'll also try to create a stress test that mimics our production
> problems but it would take some time.

That would be good to have, thanks.