Announcing Sched QoS v0.1-alpha

From: Qais Yousef

Date: Tue Apr 14 2026 - 20:10:18 EST

Hi everyone

This is the first announcement of Sched QoS 0.1-alpha release. This is still at
PoC stage and NOT production ready.

https://github.com/qais-yousef/schedqos

This is a follow of LPC 2025 discussion about Userspace Assissted Scheduling via Sched QoS

https://lpc.events/event/19/contributions/2089/

Background and Concepts
=======================

The world has changed a _little_ bit in the past 30 years..

Modern systems have sophisticated hardware that comes in all shapes and colors.
How software is written to interact with the modern hardware hasn't changed
though. Kernel had to keep up with hardware, but userspace didn't. POSIX is
ancient and didn't evolve to help OS and application writers how to deal with
these changes.

A major problem faced by many workloads is how to manage resources better so
that important workloads can get the latency or throughput they need, while
still co-exist alongside each others, and while being oblivious of the hardware
they are running on or kernel (scheduler) implementation details.

Many discussion in this area focused on what kernel interface should look like,
but I think this is the wrong approach. The same way we have libc, pthread etc
to define interface to do common operations that are detached from the OS, we
need to detach how workloads are described from OS specifics. And of course
from hardware by implication.

Sched QoS utility aims to do that. And it tries to take the unique approach of
Zero-API based approach.

One of the biggest challenges to adding any new API is the adoption time. It
can easily span 18-14 months if not more. Rolling things out for users to see
benefits has a substantial delay, iterating and reacting to problems will take
similar long time. All in all to reach a maturity point this can end up far
ahead in the future that adoption can never materialize. And it makes the
burden and the par to getting things right the first time very high that
progress can stifle at discussion point.

Another major challenge is trusting application to provide the right hints.
Managing potential abusers can be difficult (doable, but back to previous point
can lead to stifling discussion to agree on the 'right' approach). But also the
potential ABI implications that can make easy evolution and fast iteration
hard.

With config based hinting we eliminate these. Admins (which on many systems is
the user, because it's Linux..) will have to choose and approve the config to
be applied for any app. It also provides additional flexibility where
a workload is 'important' on one system, but is actually a 'background' on
another. Group control (discussed at LPC) is better way to manage this, but
generally it inherently better handles potential perf/power trade-offs
variations if tagging could be done differently to suite different needs. e.g:
you can have more than one way to tag an application to suite potential
different needs rather than being stuck with a single one imposed on you.

By using NETLINK to listen to when new tasks are forked and processes are
created, we can easily create a config based system to auto tag tasks based on
userspace description.

For example the following config

{
"chrome": {
"qos": "QOS_USER_INTERACTIVE"
},
"gcc": {
"qos": "QOS_BACKGROUND"
}
}

is a poor man's way to tell the system the user wants to treat Chrome as
a user_interactive application and gcc (compiler) as a background one.

This is a provisional way for 'easy' tagging. The real intended use is for
specific tasks within an application (process) to be tagged to describe their
individual role

{
"chrome": {
"thread_qos": {
"chrome": "QOS_USER_INTERACTIVE",
"some_bg_task": "QOS_BACKGROUND"
}
}
}

so that real interactive tasks are given the resource they need, and the noise
is reduced to just that. Note by default everything will be treated as
QOS_DEFAULT, which is set to match QOS_UTILITY. IOW everything is assumed to be
random 'noise' by default. Which makes delivering better beset-effort QoS
easier.

Roles
=====

This model is based on existing one shipped in the industry [1] that its users
are happy with. It breaks down the tasks' role into 4 classes:

* USER_INTERACTIVE: Requires immediate response
* USER_INITIATED: Tolerates short latencies, but must get work done quickly still
* UTILITY: Tolerates long delays, but not prolonged ones
* BACKGROUND: Doesn't mind prolonged delays
* DEFAULT: All untagged tasks will get this category which will map to utility.

EEVDF should allow us to describe these different levels via specifying
different runtime (custom slice) to each class. Shortest slice should still be
long enough not to sacrifice throughput. Nice values will operate as bandwidth
control so that long running user_interactive tasks can't be starved by long
running background ones if they had to run on the same CPU under overloaded
scenarios. uclamp_max help constraint power impact and access to expensive
highest performance levels.

Mapping
-------

{
"QOS_USER_INTERACTIVE": {
"sched_policy": "SCHED_NORMAL",
"sched_nice": -4,
"sched_runtime": 8000000,
"sched_util_max": 1024
},
"QOS_USER_INITIATED": {
"sched_policy": "SCHED_NORMAL",
"sched_nice": -2,
"sched_runtime": 12000000,
"sched_util_max": 768
},
"QOS_UTILITY": {
"sched_policy": "SCHED_BATCH",
"sched_nice": 2,
"sched_runtime": 16000000,
"sched_util_max": 512
},
"QOS_BACKGROUND": {
"sched_policy": "SCHED_BATCH",
"sched_nice": 4,
"sched_runtime": 20000000,
"sched_util_max": 256
},
"QOS_DEFAULT": {
"sched_policy": "SCHED_BATCH",
"sched_nice": 2,
"sched_runtime": 16000000,
"sched_util_max": 512
}
}

Caveats
=======

AUTOGROUP and cgroup cpu controller must be disabled for maximum effectiveness.
We assume a flat hierarchy and per-task description to keep the system under
control.

If they are enabled, it is hard to distinguish between background and user
interactive tasks across processes due to imposed fairness at group level.
Most notably under loaded scenarios with a lot of long running tasks.

It is believed that a flat hierarchy is best approach and using per-task
tagging combined with simple group control to ensure roles of process and tasks
are described simply, yet sufficiently, to get the desired behavior with the
least complexity and maximum portability/flexibility.

Next steps
==========

The current setup is usable and should provide tangible results for those
interested. Corner cases where it fails will be visible though under
comprehensive testing. See schbench+kernel build results below for example.

I won't repeat our LPC discussion, but we need multi modal wake up path and
coherent decision making between wake up and load balancer. Both items are
being worked on already. Push based load balance for fair is on the list [2].

For best perf/watt under schedutil, we need to introduce the concept of rampup
multiplier to help counter DVFS related latencies. Also this is WIP and there
were patches in the past [3] that will be rebased and reposted.

Idle states can be a problem for performance and power and scheduler today
doesn't take them into account at all.

Performance Inversion and Inheritance issues are common in practice and require
Proxy Execution and teaching libc and languages to move to futex_pi by default.
In practice folks will see latency/perf issues in their P99 and max at least
- severity of which will depend on the use case. See discussion on Enable PI by
default in userspace [4].

We also need to add a new unfair pi lock that doesn't operate on strict order
which is critical for performance [5]. I have gathered some results privately
that using futex_pi causes performance regressions.

The utility itself needs to handle group control still. We need to extend
NETLINK to send events when task move group and introduce a group level control
of what QoS is allowed or not. The goal is to piggy-back on cgroup but provide
user space annotation of QoS. Which is a simple allow/disallow to prevent
user_interactive tasks for instance when a task is in a 'background' group.
Think an app that is minimized or a browser tab that is hidden. This group
control might need integration with window managers so that this control is
transparently handled for all apps.

There are also problems that are not seen outside of Linux ecosystem that might
require an extra dimension to annotate tasks that can be memory sensitive
(particularly NUMA). I call this new dimension Execution Profile. But this is
an area that requires further discussion in one of the upcoming conferences and
a dedicate thread on the list. Cache Aware Scheduling can particularly be
simplified if such annotation can be provided by userspace - which this
approach should make it easy to implement and start experimenting with.

We also need to ensure sched_attr is locked and can only be modified by
schedqos utility so that it can be the sole orchestrator for managing the
behavior of the tasks.

The project would hopefully move to kernel.org and get contributions as part of
usual kernel/scheduler development process.

Results
=======

Based on tip/sched/core: 8d16e3c6f844

Cyclictest + hackbench
----------------------

hackbench -T -p -l 60000 -g 2 &
sleep 1
sudo nice -n 0 cyclictest -t 1 -i 1777 -D 30 -h 20000 -q

{
"hackbench": {
"qos": "QOS_BACKGROUND"
},
"cyclictest": {
"qos": "QOS_USER_INTERACTIVE"
}
}

Default:

# Min Latencies: 00004
# Avg Latencies: 00062
# Max Latencies: 04426

With schedqos:

# Min Latencies: 00003
# Avg Latencies: 00053
# Max Latencies: 01246

schbench + kernel build
-----------------------

AUTOGROUP and CPU controller were disabled otherwise you won't see a difference
due to imposed fairness at group level

{
"make": {
"qos": "QOS_BACKGROUND"
},
"gcc": {
"qos": "QOS_BACKGROUND"
},
"cc1": {
"qos": "QOS_BACKGROUND"
},
"schbench": {
"thread_qos": {
"schbench-msg": "QOS_USER_INTERACTIVE",
"schbench-worker": "QOS_USER_INITIATED"
}
}
}

Default:

Wakeup Latencies percentiles (usec) runtime 30 (s) (45018 total samples)
50.0th: 1618 (11870 samples)
90.0th: 3580 (18081 samples)
* 99.0th: 4952 (3973 samples)
99.9th: 6760 (405 samples)
min=1, max=11092
Request Latencies percentiles (usec) runtime 30 (s) (45042 total samples)
50.0th: 12464 (13518 samples)
90.0th: 22496 (18009 samples)
* 99.0th: 36032 (4020 samples)
99.9th: 75904 (405 samples)
min=3860, max=144284
RPS percentiles (requests) runtime 30 (s) (31 total samples)
20.0th: 1458 (7 samples)
* 50.0th: 1486 (9 samples)
90.0th: 1566 (12 samples)
min=1420, max=1603
average rps: 1501.40

With schedqos:

Wakeup Latencies percentiles (usec) runtime 30 (s) (67556 total samples)
50.0th: 10 (15337 samples)
90.0th: 6488 (26386 samples)
* 99.0th: 13168 (5961 samples)
99.9th: 19232 (607 samples)
min=1, max=32126
Request Latencies percentiles (usec) runtime 30 (s) (67618 total samples)
50.0th: 6568 (21537 samples)
90.0th: 13744 (25740 samples)
* 99.0th: 37312 (6064 samples)
99.9th: 65472 (602 samples)
min=3506, max=153046
RPS percentiles (requests) runtime 30 (s) (31 total samples)
20.0th: 2084 (7 samples)
* 50.0th: 2268 (9 samples)
90.0th: 2412 (12 samples)
min=1904, max=2523
average rps: 2253.93

Notes
-----

Throughput is better, but latencies get worse due to a number of bugs Vincent
is addressing and patches for which are already on the list [6][7][8].

One observation is disabling RUN_TO_PARITY yields better latencies. But
hopefully this won't be necessary.

Also schbench can suffer from bad task placement where two worker threads end
up on the same CPU. Once multi-modal wake up path is ready, USER_INTERACTIVE
tasks should be spread across CPUs; current wake up behavior spreads based on
load only which means we can end up with these accidental bad placements that
require the scheduler to understand that for better latencies, it is best not
to place two tasks with short deadlines on the same CPU.

With schedqos + NO_RUN_TO_PARITY + [6][7][8] patches:

Wakeup Latencies percentiles (usec) runtime 30 (s) (69260 total samples)
50.0th: 9 (26985 samples)
90.0th: 1342 (19692 samples)
* 99.0th: 1686 (6280 samples)
99.9th: 2428 (570 samples)
min=1, max=5710
Request Latencies percentiles (usec) runtime 30 (s) (69338 total samples)
50.0th: 8104 (20785 samples)
90.0th: 17696 (27798 samples)
* 99.0th: 23648 (6173 samples)
99.9th: 35904 (623 samples)
min=3835, max=90821
RPS percentiles (requests) runtime 30 (s) (31 total samples)
20.0th: 2228 (7 samples)
* 50.0th: 2300 (9 samples)
90.0th: 2404 (12 samples)
min=2171, max=2556
average rps: 2311.27

Contribute
==========

Help us get it better! See list of areas we need help with in CONTRIBUTE.md [9]

Or just give it a go and report any corner cases that doesn't work so we can
look at it and make sure we have a plan to get it fixed :)

[1] https://developer.apple.com/library/archive/documentation/Performance/Conceptual/EnergyGuide-iOS/PrioritizeWorkWithQoS.html#//apple_ref/doc/uid/TP40015243-CH39-SW1
[2] https://lore.kernel.org/lkml/20251202181242.1536213-1-vincent.guittot@xxxxxxxxxx/
[3] https://lore.kernel.org/lkml/20240820163512.1096301-1-qyousef@xxxxxxxxxxx/
[4] https://lpc.events/event/19/contributions/2244/
[5] https://developer.apple.com/documentation/os/os_unfair_lock_lock
[6] https://lore.kernel.org/lkml/20260331162352.551501-1-vincent.guittot@xxxxxxxxxx/
[7] https://lore.kernel.org/lkml/20260410132321.2897789-1-vincent.guittot@xxxxxxxxxx/
[8] https://lore.kernel.org/lkml/20260410144808.2943278-1-vincent.guittot@xxxxxxxxxx/
[9] https://github.com/qais-yousef/schedqos/blob/main/CONTRIBUTE.md