Re: Announcing Sched QoS v0.1-alpha

From: Christian Loehle

Date: Wed Apr 15 2026 - 04:59:30 EST

On 4/15/26 01:09, Qais Yousef wrote:
> Hi everyone
>
> This is the first announcement of Sched QoS 0.1-alpha release. This is still at
> PoC stage and NOT production ready.
>
> https://github.com/qais-yousef/schedqos
>
> This is a follow of LPC 2025 discussion about Userspace Assissted Scheduling via Sched QoS
>
> https://lpc.events/event/19/contributions/2089/
>
>
> Background and Concepts
> =======================
>
> The world has changed a _little_ bit in the past 30 years..
>
> Modern systems have sophisticated hardware that comes in all shapes and colors.
> How software is written to interact with the modern hardware hasn't changed
> though. Kernel had to keep up with hardware, but userspace didn't. POSIX is
> ancient and didn't evolve to help OS and application writers how to deal with
> these changes.
>
> A major problem faced by many workloads is how to manage resources better so
> that important workloads can get the latency or throughput they need, while
> still co-exist alongside each others, and while being oblivious of the hardware
> they are running on or kernel (scheduler) implementation details.
>
> Many discussion in this area focused on what kernel interface should look like,
> but I think this is the wrong approach. The same way we have libc, pthread etc
> to define interface to do common operations that are detached from the OS, we
> need to detach how workloads are described from OS specifics. And of course
> from hardware by implication.
>
> Sched QoS utility aims to do that. And it tries to take the unique approach of
> Zero-API based approach.
>
> One of the biggest challenges to adding any new API is the adoption time. It
> can easily span 18-14 months if not more. Rolling things out for users to see
> benefits has a substantial delay, iterating and reacting to problems will take
> similar long time. All in all to reach a maturity point this can end up far
> ahead in the future that adoption can never materialize. And it makes the
> burden and the par to getting things right the first time very high that
> progress can stifle at discussion point.
>
> Another major challenge is trusting application to provide the right hints.
> Managing potential abusers can be difficult (doable, but back to previous point
> can lead to stifling discussion to agree on the 'right' approach). But also the
> potential ABI implications that can make easy evolution and fast iteration
> hard.
>
> With config based hinting we eliminate these. Admins (which on many systems is
> the user, because it's Linux..) will have to choose and approve the config to
> be applied for any app. It also provides additional flexibility where
> a workload is 'important' on one system, but is actually a 'background' on
> another. Group control (discussed at LPC) is better way to manage this, but
> generally it inherently better handles potential perf/power trade-offs
> variations if tagging could be done differently to suite different needs. e.g:
> you can have more than one way to tag an application to suite potential
> different needs rather than being stuck with a single one imposed on you.
>
> By using NETLINK to listen to when new tasks are forked and processes are
> created, we can easily create a config based system to auto tag tasks based on
> userspace description.
>
> For example the following config
>
> {
> "chrome": {
> "qos": "QOS_USER_INTERACTIVE"
> },
> "gcc": {
> "qos": "QOS_BACKGROUND"
> }
> }
>
> is a poor man's way to tell the system the user wants to treat Chrome as
> a user_interactive application and gcc (compiler) as a background one.
>
> This is a provisional way for 'easy' tagging. The real intended use is for
> specific tasks within an application (process) to be tagged to describe their
> individual role
>
> {
> "chrome": {
> "thread_qos": {
> "chrome": "QOS_USER_INTERACTIVE",
> "some_bg_task": "QOS_BACKGROUND"
> }
> }
> }
>
> so that real interactive tasks are given the resource they need, and the noise
> is reduced to just that. Note by default everything will be treated as
> QOS_DEFAULT, which is set to match QOS_UTILITY. IOW everything is assumed to be
> random 'noise' by default. Which makes delivering better beset-effort QoS
> easier.
>
> Roles
> =====
>
> This model is based on existing one shipped in the industry [1] that its users
> are happy with. It breaks down the tasks' role into 4 classes:
>
> * USER_INTERACTIVE: Requires immediate response
> * USER_INITIATED: Tolerates short latencies, but must get work done quickly still
> * UTILITY: Tolerates long delays, but not prolonged ones
> * BACKGROUND: Doesn't mind prolonged delays
> * DEFAULT: All untagged tasks will get this category which will map to utility.
>
> EEVDF should allow us to describe these different levels via specifying
> different runtime (custom slice) to each class. Shortest slice should still be
> long enough not to sacrifice throughput. Nice values will operate as bandwidth
> control so that long running user_interactive tasks can't be starved by long
> running background ones if they had to run on the same CPU under overloaded
> scenarios. uclamp_max help constraint power impact and access to expensive
> highest performance levels.
>
> Mapping
> -------
>
> {
> "QOS_USER_INTERACTIVE": {
> "sched_policy": "SCHED_NORMAL",
> "sched_nice": -4,
> "sched_runtime": 8000000,
> "sched_util_max": 1024
> },
> "QOS_USER_INITIATED": {
> "sched_policy": "SCHED_NORMAL",
> "sched_nice": -2,
> "sched_runtime": 12000000,
> "sched_util_max": 768
> },
> "QOS_UTILITY": {
> "sched_policy": "SCHED_BATCH",
> "sched_nice": 2,
> "sched_runtime": 16000000,
> "sched_util_max": 512
> },
> "QOS_BACKGROUND": {
> "sched_policy": "SCHED_BATCH",
> "sched_nice": 4,
> "sched_runtime": 20000000,
> "sched_util_max": 256
> },
> "QOS_DEFAULT": {
> "sched_policy": "SCHED_BATCH",
> "sched_nice": 2,
> "sched_runtime": 16000000,
> "sched_util_max": 512
> }
> }
>
Is my understand correct that this is device-agnostic?
In particular sched_util_max seems very platform-dependent?
Also these could very well all land on the same big cluster and be effectively
void then.
And I don't think I generally buy the argument that uclamp_max is even a good
generic way to save power, not with these fixed (and arbitrary) values (e.g.
256 might be just the threshold to allow/require very inefficient OPPs that
happen frequently, in particular with the instability of utilization values
under sched_util_max / restricted compute capacity).