Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice
From: Sebastian Andrzej Siewior
Date: Thu Feb 06 2025 - 10:02:12 EST
On 2025-02-06 15:27:17 [+0100], Peter Zijlstra wrote:
> On Thu, Feb 06, 2025 at 03:22:34PM +0100, Sebastian Andrzej Siewior wrote:
> > Then this feature adds 20us on top?
>
> The point has always been for the number to be < the observable
> scheduling latency.
>
> I'm not sure what that number is, and it is always hardware dependent. I
> measured it on a random test box when I did the prototype a long while
> ago, and ended up at 50us, but for all I know that machine was running a
> lockdep enabled kernel at the time (won't be the first and certainly
> won't be the last time I try and do a performance measurement on a debug
> kernel).
When I have lockdep enabled, I have scheduling latencies >1ms.
> That was not the important part -- but everybody fixates on the number,
> instead of the intent.
I don't mind to delay a SCHED_OTHER wakeup for the greater good. And
here a number, 50us, be it. This is certainly not something I complain.
I'm just asking not to delayed the wakeup of the RT task which should be
on CPU based on its priority.
Depending on RT application, it is not just the interrupt and
preempt-off section that you worry about. It could also involve to PI a
SCHED_OTHER task on a different CPU to release the lock in question so
that the RT application on _this_ CPU can make progress. So you have
50us on the this CPU and 50us on the remote CPU because it also does
LAZY thingy for performance reasons. And so the number doubled.
> I'm assuming you have a recent number around -- what's sane? 5us, less?
As I tried to explain any additional delay hurts. If your application
requires a latency of 1ms, you get max 100us based on testing then
additional 50us certainly won't hurt you. However if you require 200us
max, you already struggle with 160us especially if everything fires at
once and the caches are gone. In this the 5us will still fit the
requirement on paper but the buffer got smaller. Also, the 5us requires
a timer to fire etc…
There are "bigger" x86 boxes with high clocked CPU and big caches which
can be partitioned and so on and everything is nice.
There are also smaller x86 boxes where you have two trace_printk() after
each other in an IRQ-off region 2us apart and yell at the scheduler for
taking 30us for a scheduling decision.
Sebastian