Tree based scheduling

Rauli Ruohonen (raulir@fishy.pp.sci.fi)
Sat, 17 Jan 1998 13:38:31 +0200 (EET)


I wrote this little document a while ago but never got around coding the
thing.. I might give it a shot when I have more time, but before I do
anything: Comments? Is this a good idea? Note that I don't know the kernel
code very well and probably can't code this yet, so don't expect
anything.. :)

----8<---8<---8<-------cut here----8<----8<-----8<-------------------

struct sched_node_pid
{
sched_node_pid *next;
sched_node_parent *parent,*child;
int pid; /* Add uid field for "user scheduling" support */
} sched_node_pid;

struct sched_node_parent
{
sched_node_pid *parent;

sched_node_pid *(*scheduler)(sched_node_parent *);
void (*add_node)(sched_node_pid *);
void (*del_node)(sched_node_pid *);

void *priv_data;
int priv_size;
} sched_node_parent;

Example of a possible tree:

[*]---<normal>---<soft realtime>---[idle task]
| |
| [*]---[rtproc1]---[rtproc2]
|
[*]---<user1>---<user2>---[root's process]---[root's process]
| |
| [*]---[proc1]---[proc2]---[proc3]
|
[*]---<proc1>---[proc2]---[proc3]---[proc4]
|
[*]---[thread1]---[thread2]---[thread3]

All parent nodes (sched_node_parent) are marked with [*]. As you can
see from the structs, every child/parent of a normal node is a parent
node and every parent of a parent node is a normal node. E.g. parent
of [thread1] is the [*] on the left, and [thread2] has the same
parent. The parent of the parent node is <proc1>, which is a (semi-)normal
node. Nodes with <> marks (like <proc1>) are never scheduled
themselves because they have children. A node that can be scheduled
always represents a runnable process and never has any children,
i.e. it's a leaf node.

This tree has nothing to do with the parent/child/sibling
relationships of processes, this is just the way they are
scheduled. Functions that change process state (setuid(),
setpriority(), ...) should update this tree appropriately. Another
thing is that those [user1], [user2],... need not be there, you can
implement the usual scheme of scheduling with this.. The point is,
that you can implement more "fair" and complicated schemes with this
design, and they could actually work faster (but would take a little
more memory - a small price to pay).

Each parent node has its own scheduler, and normally this would be a
standard scheduler that has the following structure (pointed to by
void *priv_data):

struct sched_data
{
sched_node_pid *ready,*realtime,*idle,*blocking,*exhausted;
sched_node_pid *cur_ready,*cur_realtime,*cur_idle;
} sched_data;

The standard scheduler first finds a sched_node_pid pointer from
realtime, ready or idle queue. If realtime process is available, it
will not take a process from the ready queue, and idle processes are
selected only when no process can be gotten from the ready or realtime
queues. If a non-runnable process is encountered, it will be moved to
the blocking queue and a new search is made. (The blocking queue
should be checked every time we are asked to schedule so that
processes actually wake up.. :)) When a process uses its time slice up
(only happens with ready-queue processes), it will be moved to the
exhausted queue so that scheduling won't slow down. When all processes
have used their timeslice, the scheduler will do "ready=exhausted;
exhausted=NULL;". If no processes are available for scheduling the
scheduler will return NULL.
Now that the scheduler has a non-null sched_node_pid pointer, it
checks whether it has any children. If it has, it calls the scheduler
of the child, otherwise it returns the pointer. This scheduler should
perform better than the current one, since it doesn't loop through the
process list all the time.. It might become an issue when people start
to get used to thousands of threads that become possible with Ingo's
unlimited processes patch.

The actual scheduler() function then becomes quite simple: it does
root_node->schedule(root_node) and switches to that process. I haven't
put much thought on SMP, but maybe it could be implemented by having a
node for each processor just under the root node, and each processor
would pick process from its own node. If idle task would be executed,
the scheduler could migrate a process from another node if possible.

If each user has his/her own node, they can set priorities of their
own processes as they wish: If they set one of their processes to
realtime priority, it has that priority only over other processes of
that user, not over processes of other users. If root wants to set a
"real" realtime priority to an arbitrary process, (s)he should move it
upper in the scheduler tree (with an appropriate utility, of course).
In addition, when user sets process priority to "IDLE", it means that
"when *MY* processes are idle and I have time-slices left". If a real
idle process is wanted, it would be moved to the
"root_node->ready->idle" queue (not to "root_node->idle" because the
system idle process would compete with it :)).

Since each process with threads has its own scheduler, a threaded
program could choose a policy itself. If they set SCHED_FIFO and have
equal priorities, they could select that themselves, because system
wouldn't switch to another thread without an explicit sched_yield()
call from the runnable thread.

Note that most of this document is simplifying things *a lot*, and
some issues can only be settled by actually coding the thing and then
deciding which method to use. I haven't put much thought to locking
issues, data structures, etc. here so think of those structs as
pseudocode with lots of missing or unnecessary variables :*)

There's yet another scheduler addition I thought would be nice:
SCHED_LATENCY priority or whatever it would be called. It would act as
SCHED_RR except that it could be set by normal users and it would have
a time-slice limitation - it would differ from normal processes in
that it would be scheduled almost immediately after it becomes
runnable (good priority to set for e.g. games, xanim and mpg123 :)).

SCHED_BATCH could also be good, it would also be just under the
[normal] node (just like if it were a user node). The batch scheduler
would use big timeslices so the batch programs wouldn't cause
excessive swapping by competing with each other, but not too big so
that they would still appear to be multi-tasked :) The amount of real
time slices would stay the same, a batch process could just use
e.g. 100 of them before the scheduler switches to another batch
process. Other processes wouldn't suffer from this, and if the system
has no interactive processes, there would be less real context
switches as well. And if the batch processes are important and users
can suffer (BOFH!), the "virtual batch user" node's priority could be
boosted a little.. :)