Re: [PATCH v2 09/13] sched/qos: Add rampup multiplier QoS

From: Qais Yousef

Date: Tue May 12 2026 - 04:02:36 EST


On 05/11/26 13:03, Peter Zijlstra wrote:
> On Mon, May 04, 2026 at 02:59:59AM +0100, Qais Yousef wrote:
>
> > diff --git a/Documentation/scheduler/sched-qos.rst b/Documentation/scheduler/sched-qos.rst
> > index 0911261cb124..f68856f23b6b 100644
> > --- a/Documentation/scheduler/sched-qos.rst
> > +++ b/Documentation/scheduler/sched-qos.rst
> > @@ -42,3 +42,25 @@ need for extension will arise; and when this happen the task should be
> > simpler to add the kernel extension and allow userspace to use readily by
> > setting the newly added flag without having to update the whole of
> > sched_attr.
> > +
> > +2. QoS Tags
> > +===========
> > +
> > +SCHED_QOS_RAMPUP_MULTIPLIER
> > +---------------------------
> > +
> > +Controls how fast util signal rises. Affects frequency selection when schedutil
> > +is in use. And affects how fast tasks migrate between clusters on HMP systems.
> > +
> > +It affects bursty tasks only. Perfectly periodic tasks are well described by
> > +util_avg and the rampup multiplier will have no effect on them.
> > +
> > +When set to 0, util_est will be disabled to help further with power saving.
> > +This behavior can be controlled via UTIL_EST_RAMPUP_ZERO sched_feature.
> > +
> > +Value is not capped to retain flexibility, but it tapers off very quickly to
> > +notice a difference above 16. Roughly it takes ~200ms to reach a util_avg of
> > +1000 starting from 0. With 16 it should take ~12.5ms. A range of 0-8 is
> > +advised for general use.
> > +
> > +Cookie must always be set to 0.
>
> So this is a very specific feature. This is made possible by basically
> having a huge type space, allowing for throw-away hints (as per the
> previous email).

Hmm. It is specific and generic. It is specific in a sense it is about the rise
time through performance level and scheduler integration with schedutil. It is
generic also because it is about the time it takes scheduler/kernel to move
through performance levels. I could change the description to focus on these
generic elements of DVFS response time and migration time for HMP systems.

I think if we move away from PELT etc, the concept will still be valid but
implemented differently unless the new implementation can't use the concept of
a multiplier for some reason to speed up the rise time.

>
> I suppose having these specific hints is easy, but as per always there
> is the discussion about describing task behaviour vs implementation
> details. With the argument being that task behaviour might be a more
> lasting / stable hint, while implementation details are far easier to
> actually do.
>
> I'm missing this discussion.

The intention is to describe task behavior. But being practical as well and
allow solve real world problems with ease - so if implementation detail
description will help us fix problems simply and easily, then I am for it.

The question is how to protect ourselves? :-)

This is where the two levels of QoS can help.

One level is for app developers, which is high level abstraction that is
detached from OS internals and details. This is done in schedqos I announced
recently. The goal is for users to use the QoS exposed by this service and not
to interact directly with scheduler/kernel.

The other level is this one proposed here; which is to enable this smart
service to provide a meaningful abstraction for end users, but not directly
being used by them - and we can define it whatever we like.

And this brings us to a contentious point, how to protect and enforce this
behavior?

I think we need to enforce that these hints are used by some all knowing entity
and for sched_attr to be locked down by everyone except it. Vincent was
suggesting to use SELinux to lockdown sched_attrs, but given recent issues with
tcmalloc I think we must eneforce something at kernel level. CAP_NICE is spread
around and we don't want to mix and match how sched_attr and these new QoS are
used.

To address this I think we need to introduce a new CAP_PERF_MANAGER (or pick
your favourite name here) that can only be set for specific binaries and only
one binary is allowed to exec with this capability. If two binaries with this
capability try to run, then the second one will fail unless the first one has
exited first. And when it is running, we lock down sched_setattr() except for
this CAP_PERF_MANAGER.

I am not sure if this is enough, but I think we must enforce the usage pattern
else we can end up with a mess. I think we all agree it is hard for
applications to use sched_attr in general directly, given the benefit of
a hindsight. I commonly see the simple nice value misused in practice for
example.

Ideally I'd love to enforce a single trusted binary if that can be done :p