Re: [PATCH] sched: Further restrict the preemption modes

From: Shrikanth Hegde

Date: Wed Feb 25 2026 - 07:56:48 EST

On 2/25/26 4:23 PM, Peter Zijlstra wrote:

On Fri, Jan 09, 2026 at 04:53:04PM +0530, Shrikanth Hegde wrote:

--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -243,7 +243,7 @@ static ssize_t sched_dynamic_write(struc
static int sched_dynamic_show(struct seq_file *m, void *v)
{
- int i = IS_ENABLED(CONFIG_PREEMPT_RT) * 2;
+ int i = (IS_ENABLED(CONFIG_PREEMPT_RT) || IS_ENABLED(CONFIG_ARCH_HAS_PREEMPT_LAZY)) * 2;
int j;
/* Count entries in NULL terminated preempt_modes */

Maybe only change the default to LAZY, but keep other options possible via
dynamic update?

- When the kernel changes to lazy being the default, the scheduling pattern
can change and it may affect the workloads. having ability to dynamically
change to none/voluntary could help one to figure out where
it is regressing. we could document cases where regression is expected.

I suppose we could do this. I just worry people will end up with 'echo
volatile > /debug/sched/preempt' in their startup script, rather than
trying to actually debug their issues.

Ack.

Anybody with enough knowledge to be useful, can edit this line on their
own, rebuild the kernel and go forth.

Also, I've already heard people are interested in compile-time removing
of cond_resched() infrastructure for ARCH_HAS_PREEMPT_LAZY, so this
would be short lived indeed.

- with preempt=full/lazy we will likely never see softlockups. How are we
going to find out longer kernel paths(some maybe design, some may be bugs)
apart from observing workload regression?

Given the utter cargo cult placement of cond_resched(); I don't think
we've actually lost much here. You wouldn't have seen the softlockup
thing anyway, because of cond_resched().

Anyway, you can always build on top of function graph tracing, create a
flame graph of stuff and see just where all your runtime went. I'm sure
there's tools that do this already. Perhaps if you're handy with the BPF
stuff you can even create a 'watchdog' of sorts that will scream if any
function takes longer than X us to run or whatever.

Oh, that reminds me, Steve, would it make sense to have
task_struct::se.sum_exec_runtime as a trace-clock?

Also, is softlockup code is of any use in preempt=full/lazy?

Softlockup has always seemed of dubious value to me -- then again, I've
been running preempt=y kernels from about the day that became an option
:-)

I think it still trips if you loose a wakeup or something.

That's probably hungtask report right?
IIUC that would be independent of preemption model.