Re: [PATCH] sched: Further restrict the preemption modes
From: Shrikanth Hegde
Date: Fri Jan 09 2026 - 06:26:08 EST
Hi Peter.
On 12/19/25 3:45 PM, Peter Zijlstra wrote:
[ with 6.18 being an LTS release, it might be a good time for this ]
The introduction of PREEMPT_LAZY was for multiple reasons:
- PREEMPT_RT suffered from over-scheduling, hurting performance compared to
!PREEMPT_RT.
- the introduction of (more) features that rely on preemption; like
folio_zero_user() which can do large memset() without preemption checks.
(Xen already had a horrible hack to deal with long running hypercalls)
- the endless and uncontrolled sprinkling of cond_resched() -- mostly cargo
cult or in response to poor to replicate workloads.
By moving to a model that is fundamentally preemptable these things become
manageable and avoid needing to introduce more horrible hacks.
Since this is a requirement; limit PREEMPT_NONE to architectures that do not
support preemption at all. Further limit PREEMPT_VOLUNTARY to those
architectures that do not yet have PREEMPT_LAZY support (with the eventual goal
to make this the empty set and completely remove voluntary preemption and
cond_resched() -- notably VOLUNTARY is already limited to !ARCH_NO_PREEMPT.)
This leaves up-to-date architectures (arm64, loongarch, powerpc, riscv, s390,
x86) with only two preemption models: full and lazy (like PREEMPT_RT).
While Lazy has been the recommended setting for a while, not all distributions
have managed to make the switch yet. Force things along. Keep the patch minimal
in case of hard to address regressions that might pop up.
Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
---
kernel/Kconfig.preempt | 3 +++
kernel/sched/core.c | 2 +-
kernel/sched/debug.c | 2 +-
3 files changed, 5 insertions(+), 2 deletions(-)
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -16,11 +16,13 @@ config ARCH_HAS_PREEMPT_LAZY
choice
prompt "Preemption Model"
+ default PREEMPT_LAZY if ARCH_HAS_PREEMPT_LAZY
default PREEMPT_NONE
config PREEMPT_NONE
bool "No Forced Preemption (Server)"
depends on !PREEMPT_RT
+ depends on ARCH_NO_PREEMPT
select PREEMPT_NONE_BUILD if !PREEMPT_DYNAMIC
help
This is the traditional Linux preemption model, geared towards
@@ -35,6 +37,7 @@ config PREEMPT_NONE
config PREEMPT_VOLUNTARY
bool "Voluntary Kernel Preemption (Desktop)"
+ depends on !ARCH_HAS_PREEMPT_LAZY
depends on !ARCH_NO_PREEMPT
depends on !PREEMPT_RT
select PREEMPT_VOLUNTARY_BUILD if !PREEMPT_DYNAMIC
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7553,7 +7553,7 @@ int preempt_dynamic_mode = preempt_dynam
int sched_dynamic_mode(const char *str)
{
-# ifndef CONFIG_PREEMPT_RT
+# if !(defined(CONFIG_PREEMPT_RT) || defined(CONFIG_ARCH_HAS_PREEMPT_LAZY))
if (!strcmp(str, "none"))
return preempt_dynamic_none;
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -243,7 +243,7 @@ static ssize_t sched_dynamic_write(struc
static int sched_dynamic_show(struct seq_file *m, void *v)
{
- int i = IS_ENABLED(CONFIG_PREEMPT_RT) * 2;
+ int i = (IS_ENABLED(CONFIG_PREEMPT_RT) || IS_ENABLED(CONFIG_ARCH_HAS_PREEMPT_LAZY)) * 2;
int j;
/* Count entries in NULL terminated preempt_modes */
Maybe only change the default to LAZY, but keep other options possible
via dynamic update?
- When the kernel changes to lazy being the default, the scheduling
pattern can change and it may affect the workloads. having ability to
dynamically change to none/voluntary could help one to figure out where
it is regressing. we could document cases where regression is expected.
- with preempt=full/lazy we will likely never see softlockups. How are
we going to find out longer kernel paths(some maybe design, some may be
bugs) apart from observing workload regression?
Also, is softlockup code is of any use in preempt=full/lazy?