Re: [RFC PATCH 0/6] sched: uclamp sum aggregation

From: Hongyan Xia
Date: Sun Dec 03 2023 - 20:48:49 EST


Hi Qais,

On 03/12/2023 00:25, Qais Yousef wrote:
Hi Hongyan

On 10/04/23 10:04, Hongyan Xia wrote:
Current implementation of uclamp leaves many things to be desired.
There are several problems:

1. Max aggregation is fragile. A single task with a high UCLAMP_MAX (or
with the default value, which is 1024) can ruin the previous
settings the moment it is enqueued, as shown in the uclamp frequency
spike problem in Section 5.2 of
Documentation/scheduler/sched-util-clamp.rst. Constantly running at
1024 utilization implies that the CPU is driven at its max capacity.
However, with UCLAMP_MAX, this assumption is no longer true. To
mitigate this, one idea is to filter out uclamp values for
short-running tasks. However, the effectiveness of this idea remains
to be seen.

This was proved when I was at arm and it is used in products now too. It is
effective.


2. No way to differentiate between UCLAMP_MAX throttled CPU or a CPU
running at its peak, as both show utilization of 1024. However, in
certain cases it's important to tell the difference, as we still want
to take the opportunity to enqueue tasks in the former case.

It's also debatable whether uclamp should be a frequency hint. An

It is not debatable. Uclamp is a performance hint which translates into task
placement (for HMP) and frequency selection. On SMP the latter is the only
meaning for uclamp. For the time being at least until we get a new technology
that can impact performance ;-)

I agree. In this new series, uclamp is still a performance hint which affects task placement and frequency, so we don't have a disagreement here.

example is a system with a mid CPU of capacity 500, and a little CPU
with capacity 100. If we have 10 tasks with UCLAMP_MIN of 101, then
util_fits_cpu() will schedule them on the mid CPU because feec()
thinks they don't fit on the little, even though we should use the
little core at some point instead of making the mid even more crowded.

This is not a problem. One of the major design points for uclamp is to allow
small tasks to request to run faster. So if you have 100 tasks that run for
100us and to meet that they need to run at mid CPU that is fine. That doesn't
necessarily mean they'll overutilized the cluster.

I understand. This design point is what sum aggregation does as well. This alone is not a problem, but moving a CPU to another CPU to fit uclamp_min requirement can be a problem, when the destination CPU is crowded and mixed with uclamp_max. I will elaborate on this below.


If the cluster gets overutilized, then EAS will be disabled and
find_idle_capacity() will spread into highest spare capacity idle CPU, which
includes little CPUs. And this behavior is no different to have tasks with util
greater than 100 wanting to run on mid cores. The only difference is that we
can pack more on mid core before overutilized can be triggered. Which is the
desired effect.

So this is a no issue and we should spread once the cluster is busy.
Of course, this in reality doesn't happen because of other mechanisms
like load balancing, but it's also not good when other mechanisms can
cancel the effect of uclamp in feec(). A CPU running at capacity 500 for
1ms or for 1000ms gives completely different performance levels, so
trying to fit only the frequency does not give us any performance
guarantees. If we then talk about not only running at some frequency but
also for some amount of time, then what we really mean is a capacity
hint, not a frequency hint.

I'm not sure I parsed that. But again, it's a performance hint which implies
both.


It's even worse when the tasks scheduled on the mid CPU also have
UCLAMP_MAX values. In the previous example, instead of running at 500, a
single UCLAMP_MAX, assuming it's 110, can make the mid CPU run at a much
lower frequency than 500, so it is then a really bad decision to honor
the UCLAMP_MIN frequency hint and place it on the mid CPU, instead of
running it on the little CPU which is less crowded, and has pretty much
the same capacity (100 vs 110) anyway.

I can't follow this reasoning.

It seems you're assuming the user has picked 110 by mistake and it is okay to
ignore it.

FYI, one of the use cases is to set uclamp_min to capacity_of(little_core)+1
because they're so crappy to handle their workload. It is still better than
using affinity as when the system gets busy we can still use the little core.
But only when the system is busy and crowded already.

I don't know what's the reference to the mid core having a task with
uclamp_max, but seems a bug in feec() if not predicting the frequency
correctly.

The example in the cover letter might be complex. I can simplify as a "N tasks" problem.

Say 2 CPUs, one with capacity 1024 and the other 512. Assume there's a task with uclamp_min 513 and uclamp_max of 800, then the big CPU can, in theory, fit an infinite number of such tasks on the big CPU, without triggering overutilization. What does uclamp_min even mean when this happens? uclamp_min should be a hint of a minimum performance it should get, but it breaks down when there are N tasks, so the hint from uclamp_min is divided by N. Under max aggregation, uclamp_min doesn't give you the "performance" of this task. It gives performance of the CPU, and this is not the performance of this task when N > 1. In fact, N can be infinity. This is the same problem shared by multiple uclamp users from LPC, where max aggregation hints the performance of the CPU, but not the task itself.

As to "FYI, one of the use cases is to set uclamp_min to capacity_of(little_core)+1 because they're so crappy to handle their workload", I'm afraid I disagree with this. I think uclamp is a simple hinting mechanism to the scheduler to say: please try to give me [uclamp_min, uclamp_max] performance. It's then the job of EAS to pick the most efficient CPU. We know the little CPU is crappy because we have deep knowledge of the device, but uclamp values could just come from game developers (using ADPF) who have no knowledge of the SoC, and in the game developer's view, asking uclamp of little CPU + 1 is simply because he/she wants that performance, not because the developer thinks the little CPU should be avoided. I don't think a soft-affinity-like property belongs here. We should improve EM and let EAS do its job instead of teaching it to interpret uclamp as affinity hints.


This series attempts to solve these problems by tracking a
util_avg_uclamp signal in tasks and cfs_rq. At task level,

This was discussed offline and brought several times on the list already.
uclamp is a performance not a bandwidth hint. I remember Dietmar was suggesting
this concept in the past and I objected that this converts uclamp into
a bandwidth hint. This overlaps with CFS bandwidth controller. We can use that
if somebody wants to limit the bandwidth the task can consume. uclamp is about
limiting the performance level it can reach. ie: it is not about cycles
consumed, but Performance Point reached.

I'd argue the current max aggregation implementation is further away from a performance hint. The moment there are more tasks interacting with each other on the same rq, the performance hinting ability of max aggregation tends to do worse, like uclamp_min being divided by N, frequency spikes, uneven task distribution (both shown in the evaluation section of my cover letter) etc, whereas sum aggregation still performs well.

Other shortcomings are not that critical, but the fact that uclamp_min's effectiveness is divided by N under max aggregation I think is not acceptable.

p->se.avg.util_avg_uclamp is basically tracking the normal util_avg, but
clamped within its uclamp min and max. At cfs_rq level, util_avg_uclamp
must always be the sum of all util_avg_uclamp of all the entities on
this cfs_rq. As a result, cfs_rq->avg.util_avg_uclamp is the sum
aggregation of all the clamped values, which indicates the frequency
this rq should run at and what the utilization is.

This is a no go for me. A task performance requirements can not be added to
resemble the 'size' of the tasks. Tasks with util_avg 100 but a uclamp_min of
1024 should allow us to pack them in the big core. Which is an important use
case to use. A bunch of small tasks that need to run faster are okay to pack on
a bigger core if they need to run faster as long as this doesn't cause
overutilized to trigger.

I see. I now understand where we start to diverge. It seems we want completely different things from uclamp.

Max aggregation: please try to place me on a CPU that can give CPU capacity [uclamp_min, uclamp_max] and run that CPU within that range.
Sum aggregation: please try to treat me like a task with utilization within [uclamp_min, uclamp_max].

Like I mentioned in the previous comment, I'm afraid I disagree with using uclamp as a CPU affinity hint, even if it's just a soft affinity. Its job should just be a performance hint and EM should tell EAS which CPU to pick under that hint. If the little CPU is crappy, then the crappiness should be reflected in the EM, and maybe dynamic EM can help.


This idea solves Problem 1 by capping the utilization of an
always-running task throttled by UCLAMP_MAX. Although the task (denoted
by Task A) has no idle time, the util_avg_uclamp signal gives its
UCLAMP_MAX value instead of 1024, so even if another task (Task B) with
a UCLAMP_MAX value at 1024 joins the rq, the util_avg_uclamp is A's
UCLAMP_MAX plus B's utilization, instead of 1024 plus B's utilization,
which means we no longer have the frequency spike problem caused by B.
This should mean that we might completely avoid the need for uclamp
filtering.

It also solves Problem 2 by tracking the capped utilization of a rq.
Using util_avg_uclamp, we are able to differentiate between a CPU
throttled by UCLAMP_MAX and one that is actually running at its peak
capacity. An always-running rq with a task having UCLAMP_MAX at 300 will
report a cfs_rq util_avg_uclamp at 300, not 1024, which means task
placement sees spare capacity on this CPU and will allocate some tasks
to it, instead of treating it the same way as a CPU actually running at

We don't need to know the actual spare capacity, though that would be nice to
have.

But generally in my view this CPU should be available for non capped task to
run on. I view this a load balancer problem where wake up path can consider
this CPU as a candidate and rely on load balancer to move this task away to
another (smaller) CPU ASAP. If this capped task is already on the smaller CPU
and there's nowhere else to place the uncapped task in the system then we must
be quite busy already.

I'm not quite sure relying on the load balancer is a good idea, because then it means we need uclamp code in the load balancer and the complexity keeps growing. An example is that after applying the patch

'sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0'

even though the scheduler now knows that CPUs with 0 spare capacity are now candidates, it creates a very unbalanced task placement and CPU0 tends to receive a lot of tasks in my tests. This certainly needs some uclamp-aware code in the load balancer, whereas sum aggregation has no such problem and balance is achieved during task placement in the first place.

If userspace ends up capping too many tasks too hard that we end up with no
idle time in the system (where otherwise we should) I consider this a bad usage
in the userspace and will ask them not to cap that hard. This is a case of one
shooting themselves in the foot. Maybe we need to care in the future, but so
far this doesn't seem a valid problem to address. I'd like to see why userspace
has failed to do proper management first before jumping to fix this.

What I'd expect more use cases of tasks that are truly 1024 but being capped.
In this case there's truly no idle time on the CPU and your proposal won't
address this. I don't think we have a way to distinguish between the two cases
at the moment.

I don't quite get what you mean. In the examples in my cover letter the capped tasks are truly 1024. Two CPUs, one with an uncapped 1024 task and the other a 1024 task but with uclamp_max of 300. The scheduler sees cfs_rq->avg.util_avg_uclamp of 1024 in the first CPU and 300 in the 2nd, so my proposal indeed sees the difference, and will have the opportunity to place more tasks on the 2nd CPU without any corner case code.

the peak. This brings us closer to the benefits of Android's sum
aggregation (see code related to CONFIG_USE_GROUP_THROTTLE at
https://android.googlesource.com/kernel/gs/+/refs/heads/android-gs-raviole-5.10-android12-d1/drivers/soc/google/vh/kernel/sched/fair.c#510),
which shows energy benefits because we are able to schedule tasks on
smaller cores which are UCLAMP_MAX capped, instead of finding a fresh

You're jumping to the wrong conclusion about why this is being used. This was
to overcome issues when using cpu.shares. This has nothing to do with
UCLAMP_MAX. uclamp_max filter is what is being used. Based on the proposal
I posted long ago

https://android-review.googlesource.com/c/kernel/common/+/1914696


Maybe I need to rephrase. I said that GROUP_THROTTLE can give us some energy benefits. I didn't draw any conclusion that this was why it was invented. I was also trying to suggest that, under sum aggregation, the same code can work for both uclamp and GROUP_THROTTLE. However, I probably need to better understand the issues around cpu.shares before saying this concretely.

Also, uclamp filtering might be another example why I think max aggregation is over-complicated. Sum aggregation has only around 300 lines of code so far without the need for filtering at all.

big CPU. However, a major difference is that this series has an
amortized cost of O(1), instead of O(n) in cpu_util_next() in Android
kernels.

It avoids the shortcomings from uclamp-being-a-frequency-hint. Under sum

This is not a short coming. Performance implies frequency. We should not
confused how schedutil works to estimate perf requirement based on historical
run time (util_avg) and the fact that a task with short historical runtime
needs to run faster. Running faster means increasing frequency.

aggregation, the scheduler sees the sum aggregated util_avg_uclamp and
will avoid the problem of enqueueing too many tasks just to fit the
frequency, and will place tasks on little CPUs in the previous example.

I think the issue of uclamp being a performance but not a bandwidth hint was
raised several times. Did I miss something that makes this not converting it to
a bandwidth hint?


I wonder what does performance hint even mean if it doesn't at least correlate to bandwidth in some way? Yes, specifying uclamp_min under max aggregation can run the CPU at least at that frequency, but if you have N tasks on the same rq, then each task can only get a share of uclamp_min/N. I highly doubt that when a mobile phone developer increases the uclamp_min of an application, he or she is okay with doubling the CPU frequency when it runs and running it only for 1/10 of the time because it gets moved to a crowded big CPU, which essentially gives 0.2x the "performance" in my opinion, so I'm not sure fully disconnecting bandwidth from performance is meaningful. Of course, a (not yet) uclamp-aware load balancer could come in and migrate those tasks, but why not avoid this problem in the first place via sum aggregation without involving the load balancer, especially when today we don't have any indication of how well a uclamp-aware load balancer can perform anyway?

In the end I may still not be able to fully grasp what the issues of sum aggregations are. I think so far I get that it's closer to "bandwidth" and not "performance" (which I think "performance" doesn't mean much if it doesn't correlate to bandwidth) and it doesn't work as a CPU affinity hint (it actually does to some degree, but the uclamp values need to be different). But I think focusing on these ignores the whole picture, which are the benefits of sum aggregation (already presented in the evaluation section of the cover letter of this series):

1. It's fewer than 350 lines of code (max aggregation is 750+ and still growing).
2. It doesn't suffer from frequency spikes or load balancing issues.
3. It needs no code to consider uclamp as a corner case in the scheduler. No need to carry around uclamp_min and uclamp_max in so many functions. util_fits_cpu() is just one line of code.
4. It's effective. When you say a uclamp_min of X, the scheduler really tries to give you X for your task, not for the CPU you are on, as seen in 54% reduction of jank in jankbench, without increasing energy consumption.
5. It looks like something multiple developers want from LPC 2023. The use cases of "additive uclamp", and hinting the host from the VM via util_guest can be achieved exactly by uclamp sum aggregation.

Given these benefits, I think it's probably a good idea to give sum aggregation a serious look. On my Pixel 6, sum aggregation either gives better benchmark scores at the same energy as max aggregation, or consumes less power at the same benchmark scores. I'll keep benchmarking it but so far so good.

Hongyan