Re: [RFC][PATCH 1/2] sched: Extended scheduler time slice

From: Steven Rostedt
Date: Thu Feb 06 2025 - 08:32:28 EST


On Wed, 5 Feb 2025 22:07:12 -0500
Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
> >
> > RT tasks don't have a time slice. They are affected by events. An external
> > interrupt coming in, or a timer going off that states something is
> > happening. Perhaps we could use this for SCHED_RR or maybe even
> > SCHED_DEADLINE, as those do have time slices.
> >
> > But if it does get used, it should only be used when the task being
> > scheduled is the same SCHED_RR priority, or if SCHED_DEADLINE will not fail
> > its guarantees.
> >
>
> Right, it would apply still to RR/DL though...

But it would have to guarantee that the RR it is delaying is of the same
priority, and that delaying the DL is not going to cause something to miss
its deadline.

>
> > > In any case, if you want this to only work on FAIR tasks and not RT
> > > tasks, why is that only possible to do with rseq() + LAZY preemption
> > > and not Prakash's new API + all preemption modes?
> > >
> > > Also you can just ignore RT tasks (not that I'm saying that's a good
> > > idea but..) in taskshrd_delay_resched() in that patch if you ever
> > > wanted to do that.
> > >
> > > I just feel the RT latency thing is a non-issue AFAICS.
> >
> > Have you worked on any RT projects before?
>
> Heh.. I think maybe you misunderstood my statement, I was mentioning
> that I felt (similar to Peter I think) that NOT adopting this feature
> generically for all tasks due to a concern of 50us latency maybe does
> not make sense since poorly designed app / random hardware already
> have this issue. I think the main concern discussed in this thread is
> (and please CMIIW):

We have code that has sub 100us latency and less. If some random user space
application applies this, adding 50us (or even 20us) will break these. And
this has nothing to do with poorly designed applications or hardware.

By adding this as a feature that works everywhere, you will break use cases
that work today.


> 1. Locking down this feature to only SCHED_OTHER versus making it
> generic (maybe sched_ext could also use it?).

sched_ext can do whatever it wants ;-)

But the reason I picked SCHED_OTHER is because that's the only policy that
has no control of when it gets preempted by lower priority processes.

This isn't about "hey I'm in a critical section can you delay higher
priority applications with strict deadlines for me?"

The scheduler tick comes at random moments. SCHED_OTHER is more about
performance and not about latency. Sure, we want better latency when it
comes to reaction times, but that's usually in the millisec range. Not
microsecond range. RT and DL tasks do care about microseconds. And every
microsecond counts. This is why I was fine in limiting this to 50us.

> 2. Tying it to specific preemption methods which may change user mode
> behavior/expectation (because LAZY is tied to preemption method).

Well, every time a user task calls a system call, it is affected by the
preemption method. And I also reported how this can work in all preemption
methods, but only for SCHED_OTHER. It will just take some work on how the
kernel handles NEED_RESCHED_LAZY. User space will be unaware of any of this.

> 3. Overloading the purpose of LAZY: My understanding is, the purpose
> of LAZY is to let the scheduler decide if it wants to preempt based on
> preemption mode. It is not based on any hint, just on the preemption
> mode. I guess you are overloading LAZY by making LAZY flag also extend
> userspace timeslice (versus say making the time-slice extension hint
> its own thing...).

I already replied about that. Note, LAZY was created in PREEMPT_RT for this
very purpose (but in the kernel), and ported to vanilla for a slightly
different purpose.

Here's the history:

PREEMPT_RT would convert spin_locks in the kernel to sleeping mutexes.

This made RT tasks respond much faster to events.

But non-RT (SCHED_OTHER) started suffering performance issues.

When looking at the performance issues, we found that it was due to tasks
holding these sleeping spin_locks and being preempted. That is, the
preemption of holding spin_locks was causing more contention and slowing
things down tremendously.

To first handle this, adaptive mutexes was introduced. These would spin
if the owner of the lock was still running, and would go to sleep if the
owner goes to sleep. This helped things quite a bit, but PREEMPT_RT was
still suffer a performance deficit compared to non-RT.

This was because of the timer tick on SCHED_OTHER tasks that could
preempt a task holding a spin lock.

NEED_RESCHED_LAZY was introduced to remedy this. It would be set for
SCHED_OTHER tasks and NEED_RESCHED for RT tasks. If the task was holding
a sleeping spin lock, the NEED_RESCHED_LAZY would not preempt the running
task, but NEED_RESCHED would. If the SCHED_OTHER task was not holding a
sleeping spin_lock it would be preempted regardless.

This improved the performance of SCHED_OTHER tasks in PREEMPT_RT to be as
good as what was in vanilla.

You see, LAZY was *created* for this purpose. Of letting the scheduler know
that the running task is in a critical section and the timer tick should
not preempt a SCHED_OTHER task.

I just wanted to extend this to SCHED_OTHER in user space too.

>
> Yes, I have worked on RT projects before -- you would know better
> than anyone. :-D. But admittedly, I haven't got to work much with
> PREEMPT_RT systems.

Just using RT policy to improve performance is not an RT project. I'm
talking about projects that if you miss a deadline things crash. Where the
project works very hard to make sure everything works as intended.

I'm totally against allowing SCHED_OTHER to use any feature that can delay
an RT/DL task (unless of course it is to help those, like priority inheritance).

There's several RT folks on this thread. I wonder if any of
them are OK with this?

-- Steve