Re: Announcing Sched QoS v0.1-alpha

From: Barry Song

Date: Fri Apr 17 2026 - 07:27:15 EST

On Wed, Apr 15, 2026 at 8:10 AM Qais Yousef <qyousef@xxxxxxxxxxx> wrote:
>
> Hi everyone
>
> This is the first announcement of Sched QoS 0.1-alpha release. This is still at
> PoC stage and NOT production ready.
>
> https://github.com/qais-yousef/schedqos

Thanks for releasing this code—I’ve been looking forward to
it for quite some time.

I tried running it on arm64, but unfortunately it crashed at schedqos.c:144.

Program received signal SIGSEGV, Segmentation fault.
Download failed: Invalid argument. Continuing without source file
./string/../sysdeps/aarch64/strcmp.S.
__GI_strcmp () at ../sysdeps/aarch64/strcmp.S:78
warning: 78 ../sysdeps/aarch64/strcmp.S: No such file or directory
(gdb) bt
#0 __GI_strcmp () at ../sysdeps/aarch64/strcmp.S:78
#1 0x0000aaaaaaaa2458 in main (argc=<optimized out>, argv=<optimized
out>) at schedqos.c:144

Then it seems that running "./schedqos start" is fine.

>
> This is a follow of LPC 2025 discussion about Userspace Assissted Scheduling via Sched QoS
>
> https://lpc.events/event/19/contributions/2089/
>
>
> Background and Concepts
> =======================
>
> The world has changed a _little_ bit in the past 30 years..
>
> Modern systems have sophisticated hardware that comes in all shapes and colors.
> How software is written to interact with the modern hardware hasn't changed
> though. Kernel had to keep up with hardware, but userspace didn't. POSIX is
> ancient and didn't evolve to help OS and application writers how to deal with
> these changes.
>
> A major problem faced by many workloads is how to manage resources better so
> that important workloads can get the latency or throughput they need, while
> still co-exist alongside each others, and while being oblivious of the hardware
> they are running on or kernel (scheduler) implementation details.
>
> Many discussion in this area focused on what kernel interface should look like,
> but I think this is the wrong approach. The same way we have libc, pthread etc
> to define interface to do common operations that are detached from the OS, we
> need to detach how workloads are described from OS specifics. And of course
> from hardware by implication.
>
> Sched QoS utility aims to do that. And it tries to take the unique approach of
> Zero-API based approach.
>
> One of the biggest challenges to adding any new API is the adoption time. It
> can easily span 18-14 months if not more. Rolling things out for users to see
> benefits has a substantial delay, iterating and reacting to problems will take
> similar long time. All in all to reach a maturity point this can end up far
> ahead in the future that adoption can never materialize. And it makes the
> burden and the par to getting things right the first time very high that
> progress can stifle at discussion point.
>
> Another major challenge is trusting application to provide the right hints.
> Managing potential abusers can be difficult (doable, but back to previous point
> can lead to stifling discussion to agree on the 'right' approach). But also the
> potential ABI implications that can make easy evolution and fast iteration
> hard.
>
> With config based hinting we eliminate these. Admins (which on many systems is
> the user, because it's Linux..) will have to choose and approve the config to
> be applied for any app. It also provides additional flexibility where
> a workload is 'important' on one system, but is actually a 'background' on
> another. Group control (discussed at LPC) is better way to manage this, but
> generally it inherently better handles potential perf/power trade-offs
> variations if tagging could be done differently to suite different needs. e.g:
> you can have more than one way to tag an application to suite potential
> different needs rather than being stuck with a single one imposed on you.
>
> By using NETLINK to listen to when new tasks are forked and processes are
> created, we can easily create a config based system to auto tag tasks based on
> userspace description.

We have a long history of relying on netlink for the device driver model
and udev. For threads, I guess the events would be much more frequent
than device hotplug/unplug—one app might create hundreds of threads.
We may end up relying on some ring-buffer-based event mechanism if we
eventually find that netlink itself becomes a bottleneck :-)

Now I see you are monitoring both PROC_EVENT_FORK and
PROC_EVENT_COMM. Wouldn’t this be sometimes duplicated, since
PR_SET_NAME may come soon after a thread is created?

>
> For example the following config
>
> {
> "chrome": {
> "qos": "QOS_USER_INTERACTIVE"
> },
> "gcc": {
> "qos": "QOS_BACKGROUND"
> }
> }
>
> is a poor man's way to tell the system the user wants to treat Chrome as
> a user_interactive application and gcc (compiler) as a background one.
>
> This is a provisional way for 'easy' tagging. The real intended use is for
> specific tasks within an application (process) to be tagged to describe their
> individual role
>
> {
> "chrome": {
> "thread_qos": {
> "chrome": "QOS_USER_INTERACTIVE",
> "some_bg_task": "QOS_BACKGROUND"
> }
> }
> }
>

I assume this depends on user code setting the thread name via
prctl(PR_SET_NAME, "name"), based on the code below?

static void iterate_threads(pid_t tgid)
{
char task_path[256];
snprintf(task_path, sizeof(task_path), "/proc/%d/task", tgid);

DIR *tdir = opendir(task_path);
struct dirent *tentry;

while (tdir && (tentry = readdir(tdir)) != NULL) {
if (is_numeric(tentry->d_name)) {
pid_t pid = atoi(tentry->d_name);
char comm[TASK_COMM_LEN];

if (get_comm_by_pid(pid, comm))
apply_thread_qos(pid, tgid, comm);
}
}
if (tdir) closedir(tdir);
}

I assume this is quite common on Android, but in Linux
distributions, almost no applications set names for their
threads?

So, the project is trying to encourage people to set proper
names for their threads, right?

> so that real interactive tasks are given the resource they need, and the noise
> is reduced to just that. Note by default everything will be treated as
> QOS_DEFAULT, which is set to match QOS_UTILITY. IOW everything is assumed to be
> random 'noise' by default. Which makes delivering better beset-effort QoS
> easier.
>
> Roles
> =====
>
> This model is based on existing one shipped in the industry [1] that its users
> are happy with. It breaks down the tasks' role into 4 classes:
>
> * USER_INTERACTIVE: Requires immediate response
> * USER_INITIATED: Tolerates short latencies, but must get work done quickly still
> * UTILITY: Tolerates long delays, but not prolonged ones
> * BACKGROUND: Doesn't mind prolonged delays
> * DEFAULT: All untagged tasks will get this category which will map to utility.
>
> EEVDF should allow us to describe these different levels via specifying
> different runtime (custom slice) to each class. Shortest slice should still be
> long enough not to sacrifice throughput. Nice values will operate as bandwidth
> control so that long running user_interactive tasks can't be starved by long
> running background ones if they had to run on the same CPU under overloaded
> scenarios. uclamp_max help constraint power impact and access to expensive
> highest performance levels.
>
> Mapping
> -------
>
> {
> "QOS_USER_INTERACTIVE": {
> "sched_policy": "SCHED_NORMAL",
> "sched_nice": -4,
> "sched_runtime": 8000000,
> "sched_util_max": 1024
> },
> "QOS_USER_INITIATED": {
> "sched_policy": "SCHED_NORMAL",
> "sched_nice": -2,
> "sched_runtime": 12000000,
> "sched_util_max": 768
> },
> "QOS_UTILITY": {
> "sched_policy": "SCHED_BATCH",
> "sched_nice": 2,
> "sched_runtime": 16000000,
> "sched_util_max": 512
> },
> "QOS_BACKGROUND": {
> "sched_policy": "SCHED_BATCH",
> "sched_nice": 4,
> "sched_runtime": 20000000,
> "sched_util_max": 256
> },
> "QOS_DEFAULT": {
> "sched_policy": "SCHED_BATCH",
> "sched_nice": 2,
> "sched_runtime": 16000000,
> "sched_util_max": 512
> }
> }
>

I now see a fairly simple call stack like
apply_thread_qos_tag() → sched_setattr(), which seems quite readable to me.

>
> Caveats
> =======
>
> AUTOGROUP and cgroup cpu controller must be disabled for maximum effectiveness.
> We assume a flat hierarchy and per-task description to keep the system under
> control.
>
> If they are enabled, it is hard to distinguish between background and user
> interactive tasks across processes due to imposed fairness at group level.
> Most notably under loaded scenarios with a lot of long running tasks.
>
> It is believed that a flat hierarchy is best approach and using per-task
> tagging combined with simple group control to ensure roles of process and tasks
> are described simply, yet sufficiently, to get the desired behavior with the
> least complexity and maximum portability/flexibility.
>

I don’t see how this is related to your schedqos, since you’re just calling
sched_setattr(). That is fully consistent with the Linux kernel model.

Am I right in assuming that whatever you set via sched_setattr() will always
take effect even in a cgroup-based system?

>
> Next steps
> ==========
>
> The current setup is usable and should provide tangible results for those
> interested. Corner cases where it fails will be visible though under
> comprehensive testing. See schbench+kernel build results below for example.
>
> I won't repeat our LPC discussion, but we need multi modal wake up path and
> coherent decision making between wake up and load balancer. Both items are
> being worked on already. Push based load balance for fair is on the list [2].
>
> For best perf/watt under schedutil, we need to introduce the concept of rampup
> multiplier to help counter DVFS related latencies. Also this is WIP and there
> were patches in the past [3] that will be rebased and reposted.
>
> Idle states can be a problem for performance and power and scheduler today
> doesn't take them into account at all.
>
> Performance Inversion and Inheritance issues are common in practice and require
> Proxy Execution and teaching libc and languages to move to futex_pi by default.
> In practice folks will see latency/perf issues in their P99 and max at least
> - severity of which will depend on the use case. See discussion on Enable PI by
> default in userspace [4].
>
> We also need to add a new unfair pi lock that doesn't operate on strict order
> which is critical for performance [5]. I have gathered some results privately
> that using futex_pi causes performance regressions.

I can imagine that if we have too much priority inheritance, we could
end up hurting the system. Priority inheritance raises the priority of
tasks that should not necessarily be in the high-priority class. This is
especially problematic when the dependency chain is long and complex.

>
> The utility itself needs to handle group control still. We need to extend
> NETLINK to send events when task move group and introduce a group level control
> of what QoS is allowed or not. The goal is to piggy-back on cgroup but provide
> user space annotation of QoS. Which is a simple allow/disallow to prevent
> user_interactive tasks for instance when a task is in a 'background' group.
> Think an app that is minimized or a browser tab that is hidden. This group
> control might need integration with window managers so that this control is
> transparently handled for all apps.
>
> There are also problems that are not seen outside of Linux ecosystem that might
> require an extra dimension to annotate tasks that can be memory sensitive
> (particularly NUMA). I call this new dimension Execution Profile. But this is
> an area that requires further discussion in one of the upcoming conferences and
> a dedicate thread on the list. Cache Aware Scheduling can particularly be
> simplified if such annotation can be provided by userspace - which this
> approach should make it easy to implement and start experimenting with.
>
> We also need to ensure sched_attr is locked and can only be modified by
> schedqos utility so that it can be the sole orchestrator for managing the
> behavior of the tasks.
>
> The project would hopefully move to kernel.org and get contributions as part of
> usual kernel/scheduler development process.
>
>
> Results
> =======
>
> Based on tip/sched/core: 8d16e3c6f844
>
> Cyclictest + hackbench
> ----------------------
>
> hackbench -T -p -l 60000 -g 2 &
> sleep 1
> sudo nice -n 0 cyclictest -t 1 -i 1777 -D 30 -h 20000 -q
>
> {
> "hackbench": {
> "qos": "QOS_BACKGROUND"
> },
> "cyclictest": {
> "qos": "QOS_USER_INTERACTIVE"
> }
> }
>

This raises a question: do you want app developers to write the
configuration file for their own apps, or do you want system
administrators to define a global configuration file?

Why is hackbench considered background and cyclictest considered
USER_INTERACTIVE? Who is supposed to know this?

App developers may not do this, and system administrators may lack
knowledge of specific apps.

BTW, assuming we have a phone with 300 installed applications, do we
need to configure each one individually, or can we use some shared
configuration? For example, on Android, threads may have the same name
across different applications, such as RenderThread, HeapTaskDaemon,
Binder:xxx etc.

> Default:
>
> # Min Latencies: 00004
> # Avg Latencies: 00062
> # Max Latencies: 04426
>
> With schedqos:
>
> # Min Latencies: 00003
> # Avg Latencies: 00053
> # Max Latencies: 01246
>
> schbench + kernel build
> -----------------------
>
> AUTOGROUP and CPU controller were disabled otherwise you won't see a difference
> due to imposed fairness at group level
>
> {
> "make": {
> "qos": "QOS_BACKGROUND"
> },
> "gcc": {
> "qos": "QOS_BACKGROUND"
> },
> "cc1": {
> "qos": "QOS_BACKGROUND"
> },
> "schbench": {
> "thread_qos": {
> "schbench-msg": "QOS_USER_INTERACTIVE",
> "schbench-worker": "QOS_USER_INITIATED"
> }

I assume you modified the schbench? In my experiments, I always get the name
-schbench.

/proc/2892/task$ ls
2892 2893 2894 2895 2896 2897
/proc/2892/task$ cat 2892/comm
schbench
/proc/2892/task$ cat 2893/comm
schbench
/proc/2892/task$ cat 2894/comm
schbench
/proc/2892/task$ cat 2895/comm
schbench
/proc/2892/task$ cat 2896/comm
schbench
/proc/2892/task$ cat 2897/comm
schbench

> }
> }
>
> Default:
>
> Wakeup Latencies percentiles (usec) runtime 30 (s) (45018 total samples)
> 50.0th: 1618 (11870 samples)
> 90.0th: 3580 (18081 samples)
> * 99.0th: 4952 (3973 samples)
> 99.9th: 6760 (405 samples)
> min=1, max=11092
> Request Latencies percentiles (usec) runtime 30 (s) (45042 total samples)
> 50.0th: 12464 (13518 samples)
> 90.0th: 22496 (18009 samples)
> * 99.0th: 36032 (4020 samples)
> 99.9th: 75904 (405 samples)
> min=3860, max=144284
> RPS percentiles (requests) runtime 30 (s) (31 total samples)
> 20.0th: 1458 (7 samples)
> * 50.0th: 1486 (9 samples)
> 90.0th: 1566 (12 samples)
> min=1420, max=1603
> average rps: 1501.40
>
> With schedqos:
>
> Wakeup Latencies percentiles (usec) runtime 30 (s) (67556 total samples)
> 50.0th: 10 (15337 samples)
> 90.0th: 6488 (26386 samples)
> * 99.0th: 13168 (5961 samples)
> 99.9th: 19232 (607 samples)
> min=1, max=32126
> Request Latencies percentiles (usec) runtime 30 (s) (67618 total samples)
> 50.0th: 6568 (21537 samples)
> 90.0th: 13744 (25740 samples)
> * 99.0th: 37312 (6064 samples)
> 99.9th: 65472 (602 samples)
> min=3506, max=153046
> RPS percentiles (requests) runtime 30 (s) (31 total samples)
> 20.0th: 2084 (7 samples)
> * 50.0th: 2268 (9 samples)
> 90.0th: 2412 (12 samples)
> min=1904, max=2523
> average rps: 2253.93
>
> Notes
> -----
>
> Throughput is better, but latencies get worse due to a number of bugs Vincent
> is addressing and patches for which are already on the list [6][7][8].
>
> One observation is disabling RUN_TO_PARITY yields better latencies. But
> hopefully this won't be necessary.
>
> Also schbench can suffer from bad task placement where two worker threads end
> up on the same CPU. Once multi-modal wake up path is ready, USER_INTERACTIVE
> tasks should be spread across CPUs; current wake up behavior spreads based on
> load only which means we can end up with these accidental bad placements that
> require the scheduler to understand that for better latencies, it is best not
> to place two tasks with short deadlines on the same CPU.
>
>
> With schedqos + NO_RUN_TO_PARITY + [6][7][8] patches:
>
> Wakeup Latencies percentiles (usec) runtime 30 (s) (69260 total samples)
> 50.0th: 9 (26985 samples)
> 90.0th: 1342 (19692 samples)
> * 99.0th: 1686 (6280 samples)
> 99.9th: 2428 (570 samples)
> min=1, max=5710
> Request Latencies percentiles (usec) runtime 30 (s) (69338 total samples)
> 50.0th: 8104 (20785 samples)
> 90.0th: 17696 (27798 samples)
> * 99.0th: 23648 (6173 samples)
> 99.9th: 35904 (623 samples)
> min=3835, max=90821
> RPS percentiles (requests) runtime 30 (s) (31 total samples)
> 20.0th: 2228 (7 samples)
> * 50.0th: 2300 (9 samples)
> 90.0th: 2404 (12 samples)
> min=2171, max=2556
> average rps: 2311.27
>
>
> Contribute
> ==========
>
> Help us get it better! See list of areas we need help with in CONTRIBUTE.md [9]
>
> Or just give it a go and report any corner cases that doesn't work so we can
> look at it and make sure we have a plan to get it fixed :)
>
>
> [1] https://developer.apple.com/library/archive/documentation/Performance/Conceptual/EnergyGuide-iOS/PrioritizeWorkWithQoS.html#//apple_ref/doc/uid/TP40015243-CH39-SW1
> [2] https://lore.kernel.org/lkml/20251202181242.1536213-1-vincent.guittot@xxxxxxxxxx/
> [3] https://lore.kernel.org/lkml/20240820163512.1096301-1-qyousef@xxxxxxxxxxx/
> [4] https://lpc.events/event/19/contributions/2244/
> [5] https://developer.apple.com/documentation/os/os_unfair_lock_lock
> [6] https://lore.kernel.org/lkml/20260331162352.551501-1-vincent.guittot@xxxxxxxxxx/
> [7] https://lore.kernel.org/lkml/20260410132321.2897789-1-vincent.guittot@xxxxxxxxxx/
> [8] https://lore.kernel.org/lkml/20260410144808.2943278-1-vincent.guittot@xxxxxxxxxx/
> [9] https://github.com/qais-yousef/schedqos/blob/main/CONTRIBUTE.md

Best Regards
Barry