RFC: task groups, loadable schedulers

Paul Barton-Davis (pbd@Op.Net)
Thu, 8 Oct 1998 03:06:45 -0400


I'd welcome feedback on a (significant) modification of the Linux
scheduler. The primary change is a new kernel abstraction, a "task
group", which is described below. This abstraction allows for the
relatively simple implementation of loadable schedulers. This system
is *implemented* already.

Although no code is included here, my UP system is running a kernel that
uses task groups and loadable schedulers. Its been through several
kernel builds and runs X and PPP etc.

>From kernel/task_group.c:

TASK GROUPS

A task group is a kernel abstraction consisting of one or more
tasks considered to form a "group". The abstraction allows various
kernel-level activities to be carried out on the group as a whole,
perhaps using group-specific functions when appropriate.

The task group support code is controlled by a single config
variable, CONFIG_TASK_GROUPS. If unset, the kernel code is left
completely unchanged.

Currently, scheduling is the only operation supported by
CONFIG_TASK_GROUPS. Specifically, it allows a task group to
specify its own scheduler which will be invoked whenever the task
group is allocated CPU time. If the group has not specified a
scheduler, the default scheduler will be use to decide which of
the group's tasks (in fact, given that its the default scheduler,
which tasks of all task groups) to run next.

The abstraction can theoretically be used for other group-related
things as well. An obvious set of examples include limiting
resource consumption, such as memory usage, network bandwidth,
disk bandwidth and so forth.

In the current design, the system begins with a single task
group, the init_tgroup. The init_task (what happens on SMP
machines where there's more than one init_task ?) belongs to this
group. Since the default operation of do_fork() does not create a
new task group, all tasks created by init_task and its children
also belong to the init_tgroup.

At any time, a task may execute clone() using the CLONE_TGROUP
flag. This (paradoxically) creates a new task group, and assigns
the newly created task to it. The new group will be scheduled by
the default scheduler until it specifies its own scheduler. Note
that the newly created task may share or not share any of it
resources with its parent: the task group abstraction has no
policy regarding shared resources between tasks in (or out of) a
group.

A task may call sys_set_tgroup_scheduler (int sched_id) to set its
group's scheduler. It may also call sys_get_tgid() to determine
the group it is in. No task in the init_tgroup can set the
scheduler for that group - it always uses the default scheduler,
whose operation is almost identical to the bottom end of the
current Linux schedule() code.

New schedulers may be loaded using conventional loadable modules;
the module just calls add_scheduler (struct scheduler_struct *) to
make its scheduler available for use.

Assigning CPU time to a task group is carried out by a high level
global scheduler. However, this too is mutable, since it is called
by indirection through a function pointer. The system call
set_global_scheduler (void (*func)()) can be used to reset this
function pointer, thereby altering the large-scale characteristics
of Linux' scheduling.

The current (toy) global scheduler just does round-robin
scheduling of each thread group, allowing each one 1 jiffie of CPU
time before allocating the CPU to another group. This is known to
be simplistic, and is simply a demonstration. Obviously, in the
default case, the init_tgroup, which contains all tasks, gets 100%
of the CPU.

No modification to other kernel interfaces is necessary to support
task groups. A task that wishes to cause scheduling to occur still
just calls schedule().

In the simplest case, where no new task groups exist and all
tasks belong to the init_tgroup, the only schedule()-time overhead
of this scheme consists of:

1) selecting the next task group from a circular list
(just following the ->next link)

2) marking its scheduled_at and incrementing its sched_count fields

3) a function call into default_scheduler() where none
existed before.

4) within the default_scheduler(), checking that any
potential task belongs to the group scheduled by the
global scheduler.

I would welcome comments on this scheme. Its already quite a lot of
fun to play with, and I believe has many (positive) ramifications for
Linux overall. It does cost a few usecs in context switch time, but it
adds enormous flexibility, quite likely enough to gain back what is
lost, through the use of specialized group schedulers. Richard Gooch's
recent struggles on this list could instead be solved individually,
using the best possible solution for the situation.

Another nice example would be a task group scheduler that exported a
page of write-only memory to user-space, and allowed a user level
thread system to scribble there to provide hints on the user level
state of things. We end up with the best of scheduler activations
(i.e. threads without the problems of either kernel threads or user
threads) but with extremely low overhead.

--p

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/