[RFC][PATCH 0/4] Generic container system

From: Paul Menage
Date: Mon Oct 02 2006 - 06:39:26 EST


This is essentially the same as the patch set that I posted last week,
with the following fixes/changes:

- CONFIG_CONTAINERS is no longer a user-selectable option - subsystems
such as cpusets that require it should select it in Kconfig.

- Each container subsystem type now has a name, and a <name>_enabled
file in the top container directory. This file contains 0 or 1 to
indicate whether the container subsystem is enabled, and can only be
modified when there are no subcontainers; disabled container subsystems
don't get new instances created when a subcontainer is created; the
subsystem-specific state is simply inherited from the parent
container.

- include a config option to default to enabled, for backwards
compatibility

- Documentation tweaks

- builds properly without CONFIG_CONTAINER_CPUACCT configured on

- should build properly with newer gccs. (I've not actually had a
chance to try building it with anything newer than gcc 3.2.2, but I've
fixed all the potential warnings/errors that PaulJ pointed out when
compiling with some unspecified newer gcc).

I've also looked at converting ResGroups to be a client of the
container system. This isn't yet complete; my thoughts so far include:

- each resource controller can be implemented as an independent
container subsystem; rather than a single "shares" and "stats" file
in each directory there will be e.g. "numtasks_shares",
"cpurc_shares", etc

- the ResGroups core will basically end up as a library that provides
the common parsing/displaying for the shares and stats file for each
controller, and the logic for propagating resources up and down the
parent/child tree.

- for some of the resource controllers we will probably require a few
extra callbacks from the container system, e.g. at fork/exit time.
I might make these a config option that the controller must "select"
in Kconfig, to avoid extra locking/overhead for subsystems such as
cpusets that don't require such callbacks.

-------------------------------------

There have recently been various proposals floating around for
resource management/accounting subsystems in the kernel, including
Res Groups, User BeanCounters and others. These all need the basic
abstraction of being able to group together multiple processes in an
aggregate, in order to track/limit the resources permitted to those
processes, and all implement this grouping in different ways.

Already existing in the kernel is the cpuset subsystem; this has a
process grouping mechanism that is mature, tested, and well documented
(particularly with regards to synchronization rules).

This patchset extracts the process grouping code from cpusets into a
generic container system, and makes the cpusets code a client of
the container system.

It also provides a very simple additional container subsystem to do
per-container CPU usage accounting; this is primarily to demonstrate
use of the container subsystem API, but is useful in its own right.

The change is implemented in four stages:

1) extract the process grouping code from cpusets into a standalone system

2) remove the process grouping code from cpusets and hook into the
container system

3) convert the container system to present a generic API, and make
cpusets a client of that API

4) add a simple CPU accounting container subsystem as an example

The intention is that the various resource management efforts can also
become container clients, with the result that:

- the userspace APIs are (somewhat) normalised

- it's easier to test out e.g. the ResGroups CPU controller in
conjunction with the UBC memory controller

- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel

Possible TODOs include:

- define a convention for populating the per-container directories so
that different subsystems don't clash with one another

- provide higher-level primitives (e.g. an easy interface to seq_file)
for files registered by subsystems.

- support subsystem deregistering

Signed-off-by: Paul Menage <menage@xxxxxxxxxx>

---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/