Re: [PATCH v2 09/13] sched/qos: Add rampup multiplier QoS

From: Qais Yousef

Date: Tue May 12 2026 - 04:56:02 EST


On 05/12/26 09:37, Christian Loehle wrote:
> On 5/12/26 08:59, Qais Yousef wrote:
> > On 05/11/26 13:03, Peter Zijlstra wrote:
> >> On Mon, May 04, 2026 at 02:59:59AM +0100, Qais Yousef wrote:
> >>
> >>> diff --git a/Documentation/scheduler/sched-qos.rst b/Documentation/scheduler/sched-qos.rst
> >>> index 0911261cb124..f68856f23b6b 100644
> >>> --- a/Documentation/scheduler/sched-qos.rst
> >>> +++ b/Documentation/scheduler/sched-qos.rst
> >>> @@ -42,3 +42,25 @@ need for extension will arise; and when this happen the task should be
> >>> simpler to add the kernel extension and allow userspace to use readily by
> >>> setting the newly added flag without having to update the whole of
> >>> sched_attr.
> >>> +
> >>> +2. QoS Tags
> >>> +===========
> >>> +
> >>> +SCHED_QOS_RAMPUP_MULTIPLIER
> >>> +---------------------------
> >>> +
> >>> +Controls how fast util signal rises. Affects frequency selection when schedutil
> >>> +is in use. And affects how fast tasks migrate between clusters on HMP systems.
> >>> +
> >>> +It affects bursty tasks only. Perfectly periodic tasks are well described by
> >>> +util_avg and the rampup multiplier will have no effect on them.
> >>> +
> >>> +When set to 0, util_est will be disabled to help further with power saving.
> >>> +This behavior can be controlled via UTIL_EST_RAMPUP_ZERO sched_feature.
> >>> +
> >>> +Value is not capped to retain flexibility, but it tapers off very quickly to
> >>> +notice a difference above 16. Roughly it takes ~200ms to reach a util_avg of
> >>> +1000 starting from 0. With 16 it should take ~12.5ms. A range of 0-8 is
> >>> +advised for general use.
> >>> +
> >>> +Cookie must always be set to 0.
> >>
> >> So this is a very specific feature. This is made possible by basically
> >> having a huge type space, allowing for throw-away hints (as per the
> >> previous email).
> >
> > Hmm. It is specific and generic. It is specific in a sense it is about the rise
> > time through performance level and scheduler integration with schedutil. It is
> > generic also because it is about the time it takes scheduler/kernel to move
> > through performance levels. I could change the description to focus on these
> > generic elements of DVFS response time and migration time for HMP systems.
> >
> > I think if we move away from PELT etc, the concept will still be valid but
> > implemented differently unless the new implementation can't use the concept of
> > a multiplier for some reason to speed up the rise time.
> >
> >>
> >> I suppose having these specific hints is easy, but as per always there
> >> is the discussion about describing task behaviour vs implementation
> >> details. With the argument being that task behaviour might be a more
> >> lasting / stable hint, while implementation details are far easier to
> >> actually do.
> >>
> >> I'm missing this discussion.
> >
> > The intention is to describe task behavior. But being practical as well and
> > allow solve real world problems with ease - so if implementation detail
> > description will help us fix problems simply and easily, then I am for it.
> >
> > The question is how to protect ourselves? :-)
> >
> > This is where the two levels of QoS can help.
> >
> > One level is for app developers, which is high level abstraction that is
> > detached from OS internals and details. This is done in schedqos I announced
> > recently. The goal is for users to use the QoS exposed by this service and not
> > to interact directly with scheduler/kernel.
> >
> > The other level is this one proposed here; which is to enable this smart
> > service to provide a meaningful abstraction for end users, but not directly
> > being used by them - and we can define it whatever we like.
> >
> > And this brings us to a contentious point, how to protect and enforce this
> > behavior?
> >
> > I think we need to enforce that these hints are used by some all knowing entity
> > and for sched_attr to be locked down by everyone except it. Vincent was
> > suggesting to use SELinux to lockdown sched_attrs, but given recent issues with
> > tcmalloc I think we must eneforce something at kernel level. CAP_NICE is spread
> > around and we don't want to mix and match how sched_attr and these new QoS are
> > used.
> >
> > To address this I think we need to introduce a new CAP_PERF_MANAGER (or pick
> > your favourite name here) that can only be set for specific binaries and only
> > one binary is allowed to exec with this capability. If two binaries with this
> > capability try to run, then the second one will fail unless the first one has
> > exited first. And when it is running, we lock down sched_setattr() except for
> > this CAP_PERF_MANAGER.>
> > I am not sure if this is enough, but I think we must enforce the usage pattern
> > else we can end up with a mess. I think we all agree it is hard for
> > applications to use sched_attr in general directly, given the benefit of
> > a hindsight. I commonly see the simple nice value misused in practice for
> > example.
> >
> > Ideally I'd love to enforce a single trusted binary if that can be done :p
> >
>
>
> Just to follow along, does that mean if an application runs with CAP_PERF_MANAGER
> any other that doesn't have CAP_PERF_MANAGER and calls any of
> sched_setattr()
> sched_setscheduler()
> sched_setparam()
> nice()
> setpriority()
>
> would get EPERM? Or silently be dropped?

I think EPERM.

> Either seems error-prone and potentially no longer work as a "Zero API adoption mechanism".

How come it is error-prone, could you explain more?

What's the relationship between this and zero api?

> Chromium and Unity seem to handle sched_setattr() failing, but unsure what the
> situation looks like generally.

You think they will crash if we block usage? These already can fail if admins
block via selinux for example.