Björn Steinbrink wrote:
The killer problem I see with this approach is that it doesn't address
the divide and conquer problem. If a task is capped, and forks off
workers, each worker inherits the total cap, effectively extending same.
Yes, although the current thinking is that you need to set a special
clone() flag (which may be restricted via capabilities such as
CAP_SYS_RESOURCE) to set a new CPU scheduling namespace, so the workers
will inherit the same scheduling ns and therefore be accounted against
the one resource.
Sorry if I'm replying out of context, I'll catch up on this thread
shortly. Thanks for including me.
IMHO, per task resource management is too severely limited in it'sLinux-VServer uses a token bucket scheduler (TBS) to limit cpu ressources
usefulness, because jobs are what need managing, and they're seldom
single threaded. In order to use per task limits to manage any given
job, you have to both know the number of components, and manually
distribute resources to each component of the job. If a job has a
dynamic number of components, it becomes impossible to manage.
for processes in a "context". All processes in a context share one token
bucket, which has a set of parameters to tune scheduling behaviour.
As the token bucket is shared by a group of processes, and inherited by
child processes/threads, management is quite easy. And the parameters
can be tuned to allow different scheduling behaviours, like allowing a
process group to burst, ie. use as much cpu time as is available, after
being idle for some time, but being limited to X % cpu time on average.
This is correct. Basically I read the LARTC.org (which explains Linux
network schedulers etc) and the description of the Token Bucket
Scheduler inspired me to write the same thing for CPU resources. It was
originally developed for the 2.4 Alan Cox series kernels. The primary
design guarantee of the scheduler is a low total performance impact -
maximum CPU utilisation prioritisation and fairness a secondary
concern. It was built with the idea that people wanting different sorts
of scheduling policies could at least get a set of userland controls to
implement their approach - to the limit of the effectiveness of task
I most recently described this at http://lkml.org/lkml/2006/3/29/59, a
lot of that thread is probably worth catching up on.
It would be nice if we could somehow re-use the scheduling algorithms in
use in the network space here, if it doesn't impact on performance.
For instance, the "CBQ" network scheduler is the same approach as used
in OpenVZ's CPU scheduler, and the classful Token Bucket Filter is the
approach used in VServer. The "Sched_prio" and "Sched_hard" distinction
in vserver could probably be compared to "Ingres Policing", where
available CPU that could run a process instead sits idle - similar to
the network world where data that has arrived is dropped to try to
convince the application to throttle its network activity.
As in the network space (http://lkml.org/lkml/2006/5/19/216) in this
space we have a continual scale of possible implementations, marked by a
highly efficient solution akin to "binding" at one end, and a
virtualisation at the other. Each deliver guarantees most applicable to
certain users or workloads.
I'm CC'ing Herbert and Sam on this as they can explain the whole thing a
lot better and I'm not familiar with implementation details.