[PATCH -v2] cpuset: various documentation fixes and updates

From: Li Zefan
Date: Thu Feb 19 2009 - 00:58:49 EST


I noticed the old commit 8f5aa26c75b7722e80c0c5c5bb833d41865d7019
("cpusets: update_cpumask documentation fix") is not a complete fix,
resulting in inconsistent paragraphs. This patch fixes it and does
other fixes and updates:

- s/migrate_all_tasks()/migrate_live_tasks()/
- describe more cpuset control files
- s/cpumask_t/struct cpumask/
- document cpu hotplug and change of 'sched_relax_domain_level' may cause
domain rebuild
- document various ways to query and modify cpusets
- the equivalent of "mount -t cpuset" is "mount -t cgroup -o cpuset,noprefix"

Signed-off-by: Li Zefan <lizf@xxxxxxxxxxxxxx>
---

v1 -> v2: fixed 2 typos pointed out by Randy.

---
Documentation/cgroups/cpusets.txt | 65 +++++++++++++++++++++----------------
1 files changed, 37 insertions(+), 28 deletions(-)

diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt
index 5c86c25..0611e95 100644
--- a/Documentation/cgroups/cpusets.txt
+++ b/Documentation/cgroups/cpusets.txt
@@ -142,7 +142,7 @@ into the rest of the kernel, none in performance critical paths:
- in fork and exit, to attach and detach a task from its cpuset.
- in sched_setaffinity, to mask the requested CPUs by what's
allowed in that tasks cpuset.
- - in sched.c migrate_all_tasks(), to keep migrating tasks within
+ - in sched.c migrate_live_tasks(), to keep migrating tasks within
the CPUs allowed by their cpuset, if possible.
- in the mbind and set_mempolicy system calls, to mask the requested
Memory Nodes by what's allowed in that tasks cpuset.
@@ -175,6 +175,10 @@ files describing that cpuset:
- mem_exclusive flag: is memory placement exclusive?
- mem_hardwall flag: is memory allocation hardwalled
- memory_pressure: measure of how much paging pressure in cpuset
+ - memory_spread_page flag: if set, spread page cache evenly on allowed nodes
+ - memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes
+ - sched_load_balance flag: if set, load balance within CPUs on that cpuset
+ - sched_relax_domain_level: the searching range when migrating tasks

In addition, the root cpuset only has the following file:
- memory_pressure_enabled flag: compute memory_pressure?
@@ -252,7 +256,7 @@ is causing.

This is useful both on tightly managed systems running a wide mix of
submitted jobs, which may choose to terminate or re-prioritize jobs that
-are trying to use more memory than allowed on the nodes assigned them,
+are trying to use more memory than allowed on the nodes assigned to them,
and with tightly coupled, long running, massively parallel scientific
computing jobs that will dramatically fail to meet required performance
goals if they start to use more memory than allowed to them.
@@ -378,7 +382,7 @@ as cpusets and sched_setaffinity.
The algorithmic cost of load balancing and its impact on key shared
kernel data structures such as the task list increases more than
linearly with the number of CPUs being balanced. So the scheduler
-has support to partition the systems CPUs into a number of sched
+has support to partition the systems CPUs into a number of sched
domains such that it only load balances within each sched domain.
Each sched domain covers some subset of the CPUs in the system;
no two sched domains overlap; some CPUs might not be in any sched
@@ -485,17 +489,22 @@ of CPUs allowed to a cpuset having 'sched_load_balance' enabled.
The internal kernel cpuset to scheduler interface passes from the
cpuset code to the scheduler code a partition of the load balanced
CPUs in the system. This partition is a set of subsets (represented
-as an array of cpumask_t) of CPUs, pairwise disjoint, that cover all
-the CPUs that must be load balanced.
-
-Whenever the 'sched_load_balance' flag changes, or CPUs come or go
-from a cpuset with this flag enabled, or a cpuset with this flag
-enabled is removed, the cpuset code builds a new such partition and
-passes it to the scheduler sched domain setup code, to have the sched
-domains rebuilt as necessary.
+as an array of struct cpumask) of CPUs, pairwise disjoint, that cover
+all the CPUs that must be load balanced.
+
+The cpuset code builds a new such partition and passes it to the
+scheduler sched domain setup code, to have the sched domains rebuilt
+as necessary, whenever:
+ - the 'sched_load_balance' flag of a cpuset with non-empty CPUs changes,
+ - or CPUs come or go from a cpuset with this flag enabled,
+ - or 'sched_relax_domain_level' value of a cpuset with non-empty CPUs
+ and with this flag enabled changes,
+ - or a cpuset with non-empty CPUs and with this flag enabled is removed,
+ - or a cpu is offlined/onlined.

This partition exactly defines what sched domains the scheduler should
-setup - one sched domain for each element (cpumask_t) in the partition.
+setup - one sched domain for each element (struct cpumask) in the
+partition.

The scheduler remembers the currently active sched domain partitions.
When the scheduler routine partition_sched_domains() is invoked from
@@ -559,7 +568,7 @@ domain, the largest value among those is used. Be careful, if one
requests 0 and others are -1 then 0 is used.

Note that modifying this file will have both good and bad effects,
-and whether it is acceptable or not will be depend on your situation.
+and whether it is acceptable or not depends on your situation.
Don't modify this file if you are not sure.

If your situation is:
@@ -600,19 +609,15 @@ to allocate a page of memory for that task.

If a cpuset has its 'cpus' modified, then each task in that cpuset
will have its allowed CPU placement changed immediately. Similarly,
-if a tasks pid is written to a cpusets 'tasks' file, in either its
-current cpuset or another cpuset, then its allowed CPU placement is
-changed immediately. If such a task had been bound to some subset
-of its cpuset using the sched_setaffinity() call, the task will be
-allowed to run on any CPU allowed in its new cpuset, negating the
-affect of the prior sched_setaffinity() call.
+if a tasks pid is written to another cpusets 'tasks' file, then its
+allowed CPU placement is changed immediately. If such a task had been
+bound to some subset of its cpuset using the sched_setaffinity() call,
+the task will be allowed to run on any CPU allowed in its new cpuset,
+negating the effect of the prior sched_setaffinity() call.

In summary, the memory placement of a task whose cpuset is changed is
updated by the kernel, on the next allocation of a page for that task,
-but the processor placement is not updated, until that tasks pid is
-rewritten to the 'tasks' file of its cpuset. This is done to avoid
-impacting the scheduler code in the kernel with a check for changes
-in a tasks processor placement.
+and the processor placement is updated immediately.

Normally, once a page is allocated (given a physical page
of main memory) then that page stays on whatever node it
@@ -681,10 +686,14 @@ and then start a subshell 'sh' in that cpuset:
# The next line should display '/Charlie'
cat /proc/self/cpuset

-In the future, a C library interface to cpusets will likely be
-available. For now, the only way to query or modify cpusets is
-via the cpuset file system, using the various cd, mkdir, echo, cat,
-rmdir commands from the shell, or their equivalent from C.
+There are ways to query or modify cpusets:
+ - via the cpuset file system directly, using the various cd, mkdir, echo,
+ cat, rmdir commands from the shell, or their equivalent from C.
+ - via the C library libcpuset.
+ - via the C library libcgroup.
+ (http://sourceforge.net/proects/libcg/)
+ - via the python application cset.
+ (http://developer.novell.com/wiki/index.php/Cpuset)

The sched_setaffinity calls can also be done at the shell prompt using
SGI's runon or Robert Love's taskset. The mbind and set_mempolicy
@@ -756,7 +765,7 @@ mount -t cpuset X /dev/cpuset

is equivalent to

-mount -t cgroup -ocpuset X /dev/cpuset
+mount -t cgroup -ocpuset,noprefix X /dev/cpuset
echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent

2.2 Adding/removing cpus
--
1.5.4.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/