Re: [PATCH] sched: Further restrict the preemption modes

From: Peter Zijlstra

Date: Wed Feb 25 2026 - 05:56:03 EST

On Fri, Jan 09, 2026 at 04:53:04PM +0530, Shrikanth Hegde wrote:

> > --- a/kernel/sched/debug.c
> > +++ b/kernel/sched/debug.c
> > @@ -243,7 +243,7 @@ static ssize_t sched_dynamic_write(struc
> > static int sched_dynamic_show(struct seq_file *m, void *v)
> > {
> > - int i = IS_ENABLED(CONFIG_PREEMPT_RT) * 2;
> > + int i = (IS_ENABLED(CONFIG_PREEMPT_RT) || IS_ENABLED(CONFIG_ARCH_HAS_PREEMPT_LAZY)) * 2;
> > int j;
> > /* Count entries in NULL terminated preempt_modes */
>
> Maybe only change the default to LAZY, but keep other options possible via
> dynamic update?
>
> - When the kernel changes to lazy being the default, the scheduling pattern
> can change and it may affect the workloads. having ability to dynamically
> change to none/voluntary could help one to figure out where
> it is regressing. we could document cases where regression is expected.

I suppose we could do this. I just worry people will end up with 'echo
volatile > /debug/sched/preempt' in their startup script, rather than
trying to actually debug their issues.

Anybody with enough knowledge to be useful, can edit this line on their
own, rebuild the kernel and go forth.

Also, I've already heard people are interested in compile-time removing
of cond_resched() infrastructure for ARCH_HAS_PREEMPT_LAZY, so this
would be short lived indeed.

> - with preempt=full/lazy we will likely never see softlockups. How are we
> going to find out longer kernel paths(some maybe design, some may be bugs)
> apart from observing workload regression?

Given the utter cargo cult placement of cond_resched(); I don't think
we've actually lost much here. You wouldn't have seen the softlockup
thing anyway, because of cond_resched().

Anyway, you can always build on top of function graph tracing, create a
flame graph of stuff and see just where all your runtime went. I'm sure
there's tools that do this already. Perhaps if you're handy with the BPF
stuff you can even create a 'watchdog' of sorts that will scream if any
function takes longer than X us to run or whatever.

Oh, that reminds me, Steve, would it make sense to have
task_struct::se.sum_exec_runtime as a trace-clock?

> Also, is softlockup code is of any use in preempt=full/lazy?

Softlockup has always seemed of dubious value to me -- then again, I've
been running preempt=y kernels from about the day that became an option
:-)

I think it still trips if you loose a wakeup or something.