[PATCH v2 RFC 08/13] sched/qos: Add a new sched-qos interface

From: Qais Yousef

Date: Sun May 03 2026 - 22:03:28 EST

Provide a generic and extensible interface to describe arbitrary QoS
tags to tell the kernel about specific behavior that is doesn't fall
into the existing sched_attr.

The interface is broken into three parts:

* Type
* Value
* Cookie

Type is an enum that should be give us enough space to extend (and
deprecate) comfortably.

Value is a signed 64bit number to allow for arbitrary high values.

Cookie is to help group tasks selectively so that some QoS might want to
operate on tasks per groups. A value of 0 indicates system wide.

There are two anticipated users being discussed on the list.

1. Per task rampup multiplier to allow controlling how fast util rises,
and by implication it can migrate between cores on HMP systems and
cause freqs to rise with schedutil.

2. Tag a group of task that are memory dependent for Cache Aware
Scheduling.

The interface is anticipated to be provisioned to apps via utilities and
libraries. schedqos [1] is an example how such interface can be used to
provide higher level QoS abstraction to describe workloads without
baking it into the binaries, and by implication without worrying about
potential abuse. The interface requires privileged access since QoS is
considered scarce resource and requires admin control to ensure it is
set properly. Again that admin control is anticipated to be the schedqos
utility service.

QoS is treated as a scarce resource and the intention is for the
a syscall to be done for each individual QoS tag. QoS tags are not
inherited on fork by default too for the same reason.

A reasonable point of debate is whether to make the sched_qos an array
of 3 or 5 value to avoid potential bottleneck if this grows large and
users do end up hitting a bottleneck of having to issue too many
syscalls to set all QoS. Being limited as it is now helps enforce
intentionality and scarcity of tagging.

[1] https://github.com/qais-yousef/schedqos

Signed-off-by: Qais Yousef <qyousef@xxxxxxxxxxx>
---
Documentation/scheduler/index.rst | 1 +
Documentation/scheduler/sched-qos.rst | 44 ++++++++++++++++++
include/uapi/linux/sched.h | 4 ++
include/uapi/linux/sched/types.h | 46 +++++++++++++++++++
kernel/sched/syscalls.c | 10 ++++
.../trace/beauty/include/uapi/linux/sched.h | 4 ++
6 files changed, 109 insertions(+)
create mode 100644 Documentation/scheduler/sched-qos.rst

diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
index 17ce8d76befc..6652f18e553b 100644
--- a/Documentation/scheduler/index.rst
+++ b/Documentation/scheduler/index.rst
@@ -23,5 +23,6 @@ Scheduler
sched-stats
sched-ext
sched-debug
+ sched-qos

text_files
diff --git a/Documentation/scheduler/sched-qos.rst b/Documentation/scheduler/sched-qos.rst
new file mode 100644
index 000000000000..0911261cb124
--- /dev/null
+++ b/Documentation/scheduler/sched-qos.rst
@@ -0,0 +1,44 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+Scheduler QoS
+=============
+
+1. Introduction
+===============
+
+Different workloads have different scheduling requirements to operate
+optimally. The same applies to tasks within the same workload.
+
+To enable smarter usage of system resources and to cater for the conflicting
+demands of various tasks, Scheduler QoS provides a mechanism to provide more
+information about those demands so that scheduler can do best-effort to
+honour them.
+
+ @sched_qos_type what QoS hint to apply
+ @sched_qos_value value of the QoS hint
+ @sched_qos_cookie magic cookie to tag a group of tasks for which the QoS
+ applies. If 0, the hint will apply globally system
+ wide. If not 0, the hint will be relative to tasks that
+ has the same cookie value only.
+
+QoS hints are set once and not inherited by children by design. The
+rationale is that each task has its individual characteristics and it is
+encouraged to describe each of these separately. Also since system resources
+are finite, there's a limit to what can be done to honour these requests
+before reaching a tipping point where there are too many requests for
+a particular QoS that is impossible to service for all of them at once and
+some will start to lose out. For example if 10 tasks require better wake
+up latencies on a 4 CPUs SMP system, then if they all wake up at once, only
+4 can perceive the hint honoured and the rest will have to wait. Inheritance
+can lead these 10 to become a 100 or a 1000 more easily, and then the QoS
+hint will lose its meaning and effectiveness rapidly. The chances of 10
+tasks waking up at the same time is lower than a 100 and lower than a 1000.
+
+To set multiple QoS hints, a syscall is required for each. This is a
+trade-off to reduce the churn on extending the interface as the hope for
+this to evolve as workloads and hardware get more sophisticated and the
+need for extension will arise; and when this happen the task should be
+simpler to add the kernel extension and allow userspace to use readily by
+setting the newly added flag without having to update the whole of
+sched_attr.
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 52b69ce89368..3cdba44bc1cb 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -102,6 +102,9 @@ struct clone_args {
__aligned_u64 set_tid_size;
__aligned_u64 cgroup;
};
+
+enum sched_qos_type {
+};
#endif

#define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
@@ -133,6 +136,7 @@ struct clone_args {
#define SCHED_FLAG_KEEP_PARAMS 0x10
#define SCHED_FLAG_UTIL_CLAMP_MIN 0x20
#define SCHED_FLAG_UTIL_CLAMP_MAX 0x40
+#define SCHED_FLAG_QOS 0x80

#define SCHED_FLAG_KEEP_ALL (SCHED_FLAG_KEEP_POLICY | \
SCHED_FLAG_KEEP_PARAMS)
diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
index bf6e9ae031c1..b65da4938f43 100644
--- a/include/uapi/linux/sched/types.h
+++ b/include/uapi/linux/sched/types.h
@@ -94,6 +94,48 @@
* scheduled on a CPU with no more capacity than the specified value.
*
* A task utilization boundary can be reset by setting the attribute to -1.
+ *
+ * Scheduler QoS
+ * =============
+ *
+ * Different workloads have different scheduling requirements to operate
+ * optimally. The same applies to tasks within the same workload.
+ *
+ * To enable smarter usage of system resources and to cater for the conflicting
+ * demands of various tasks, Scheduler QoS provides a mechanism to provide more
+ * information about those demands so that scheduler can do best-effort to
+ * honour them.
+ *
+ * @sched_qos_type what QoS hint to apply
+ * @sched_qos_value value of the QoS hint
+ * @sched_qos_cookie magic cookie to tag a group of tasks for which the QoS
+ * applies. If 0, the hint will apply globally system
+ * wide. If not 0, the hint will be relative to tasks that
+ * has the same cookie value only.
+ *
+ * QoS hints are set once and not inherited by children by design. The
+ * rationale is that each task has its individual characteristics and it is
+ * encouraged to describe each of these separately. Also since system resources
+ * are finite, there's a limit to what can be done to honour these requests
+ * before reaching a tipping point where there are too many requests for
+ * a particular QoS that is impossible to service for all of them at once and
+ * some will start to lose out. For example if 10 tasks require better wake
+ * up latencies on a 4 CPUs SMP system, then if they all wake up at once, only
+ * 4 can perceive the hint honoured and the rest will have to wait. Inheritance
+ * can lead these 10 to become a 100 or a 1000 more easily, and then the QoS
+ * hint will lose its meaning and effectiveness rapidly. The chances of 10
+ * tasks waking up at the same time is lower than a 100 and lower than a 1000.
+ *
+ * To set multiple QoS hints, a syscall is required for each. This is a
+ * trade-off to reduce the churn on extending the interface as the hope for
+ * this to evolve as workloads and hardware get more sophisticated and the
+ * need for extension will arise; and when this happen the task should be
+ * simpler to add the kernel extension and allow userspace to use readily by
+ * setting the newly added flag without having to update the whole of
+ * sched_attr.
+ *
+ * Details about the available QoS hints can be found in:
+ * Documentation/scheduler/sched-qos.rst
*/
struct sched_attr {
__u32 size;
@@ -116,6 +158,10 @@ struct sched_attr {
__u32 sched_util_min;
__u32 sched_util_max;

+ __u32 sched_qos_type;
+ __s64 sched_qos_value;
+ __u32 sched_qos_cookie;
+
};

#endif /* _UAPI_LINUX_SCHED_TYPES_H */
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index b215b0ead9a6..88feedd2f7c9 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -481,6 +481,13 @@ static int user_check_sched_setscheduler(struct task_struct *p,
if (p->sched_reset_on_fork && !reset_on_fork)
goto req_priv;

+ /*
+ * Normal users can't set QoS on their own, must go via admin
+ * controlled service
+ */
+ if (attr->sched_flags & SCHED_FLAG_QOS)
+ goto req_priv;
+
return 0;

req_priv:
@@ -552,6 +559,9 @@ int __sched_setscheduler(struct task_struct *p,
return retval;
}

+ if (attr->sched_flags & SCHED_FLAG_QOS)
+ return -EOPNOTSUPP;
+
/*
* SCHED_DEADLINE bandwidth accounting relies on stable cpusets
* information.
diff --git a/tools/perf/trace/beauty/include/uapi/linux/sched.h b/tools/perf/trace/beauty/include/uapi/linux/sched.h
index 359a14cc76a4..4ff525928430 100644
--- a/tools/perf/trace/beauty/include/uapi/linux/sched.h
+++ b/tools/perf/trace/beauty/include/uapi/linux/sched.h
@@ -102,6 +102,9 @@ struct clone_args {
__aligned_u64 set_tid_size;
__aligned_u64 cgroup;
};
+
+enum sched_qos_type {
+};
#endif

#define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
@@ -133,6 +136,7 @@ struct clone_args {
#define SCHED_FLAG_KEEP_PARAMS 0x10
#define SCHED_FLAG_UTIL_CLAMP_MIN 0x20
#define SCHED_FLAG_UTIL_CLAMP_MAX 0x40
+#define SCHED_FLAG_QOS 0x80

#define SCHED_FLAG_KEEP_ALL (SCHED_FLAG_KEEP_POLICY | \
SCHED_FLAG_KEEP_PARAMS)
--
2.34.1