Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED

From: Steven Rostedt
Date: Tue Oct 24 2023 - 10:34:35 EST


On Tue, 19 Sep 2023 01:42:03 +0200
Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:

> 2) When the scheduler wants to set NEED_RESCHED due it sets
> NEED_RESCHED_LAZY instead which is only evaluated in the return to
> user space preemption points.
>
> As NEED_RESCHED_LAZY is not folded into the preemption count the
> preemption count won't become zero, so the task can continue until
> it hits return to user space.
>
> That preserves the existing behaviour.

I'm looking into extending this concept to user space and to VMs.

I'm calling this the "extended scheduler time slice" (ESTS pronounced "estis")

The ideas is this. Have VMs/user space share a memory region with the
kernel that is per thread/vCPU. This would be registered via a syscall or
ioctl on some defined file or whatever. Then, when entering user space /
VM, if NEED_RESCHED_LAZY (or whatever it's eventually called) is set, it
checks if the thread has this memory region and a special bit in it is
set, and if it does, it does not schedule. It will treat it like a long
kernel system call.

The kernel will then set another bit in the shared memory region that will
tell user space / VM that the kernel wanted to schedule, but is allowing it
to finish its critical section. When user space / VM is done with the
critical section, it will check the bit that may be set by the kernel and
if it is set, it should do a sched_yield() or VMEXIT so that the kernel can
now schedule it.

What about DOS you say? It's no different than running a long system call.
No task can run forever. It's not a "preempt disable", it's just "give me
some more time". A "NEED_RESCHED" will always schedule, just like a kernel
system call that takes a long time. The goal is to allow user space to get
out of critical sections that we know can cause problems if they get
preempted. Usually it's a user space / VM lock is held or maybe a VM
interrupt handler that needs to wake up a task on another vCPU.

If we are worried about abuse, we could even punish tasks that don't call
sched_yield() by the time its extended time slice is taken. Even without
that punishment, if we have EEVDF, this extension will make it less
eligible the next time around.

The goal is to prevent a thread / vCPU being preempted while holding a lock
or resource that other threads / vCPUs will want. That is, prevent
contention, as that's usually the biggest issue with performance in user
space and VMs.

I'm going to work on a POC, and see if I can get some benchmarks on how
much this could help tasks like databases and VMs in general.

-- Steve