Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default

Next message: wang.yechao255: "[PATCH 0/3] RISC-V: KVM: Huge page recovery during disable-dirty-log"
Previous message: Johan Hovold: "Re: [PATCH v5 8/9] driver core: Replace dev-&gt;of_node_reused with dev_of_node_reused()"
In reply to: Peter Zijlstra: "Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default"
Next in thread: Mark Rutland: "Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

From: Peter Zijlstra

Date: Tue Apr 07 2026 - 05:08:01 EST

On Tue, Apr 07, 2026 at 10:20:18AM +0200, Peter Zijlstra wrote:
> On Sun, Apr 05, 2026 at 11:38:59AM +0530, Ritesh Harjani wrote:
>
> > However, for curiosity, I was hoping if someone more familiar with the
> > scheduler area can explain why PREEMPT_LAZY v/s PREEMPT_NONE, causes
> > performance regression w/o huge pages?
> >
> > Minor page fault handling has micro-secs latency, where as sched ticks
> > is in milli-secs. Besides, both preemption models should anyway
> > schedule() if TIF_NEED_RESCHED is set on return to userspace, right?
> >
> > So was curious to understand how is the preemption model causing
> > performance regression with no hugepages in this case?
>
> So yes, everything can schedule on return-to-user (very much including
> NONE). Which is why rseq slice ext is heavily recommended for anything
> attempting user space spinlocks.
>
> The thing where the other preemption modes differ is the scheduling
> while in kernel mode. So if the workload is spending significant time in
> the kernel, this could cause more scheduling.
>
> As you already mentioned, no huge pages, gives us more overhead on #PF
> (and TLB miss, but that's mostly hidden in access latency rather than
> immediate system time). This gives more system time, and more room to
> schedule.
>
> If we get preempted in the middle of a #PF, rather than finishing it,
> this increases the #PF completion time and if userspace is trying to
> access this page concurrently.... But we should see that in mmap_lock
> contention/idle time :/

Sorry, insufficient wake-up juice applied. Concurrent page-faults are
serialized on the page-table (spin) locks. Not mmap_lock.

So it would increase system time and give more rise to kernel
preemption.