Re: [PATCH -v2] cpuset: various documentation fixes and updates
From: Randy Dunlap
Date: Thu Feb 19 2009 - 11:23:32 EST
Li Zefan wrote:
> I noticed the old commit 8f5aa26c75b7722e80c0c5c5bb833d41865d7019
> ("cpusets: update_cpumask documentation fix") is not a complete fix,
> resulting in inconsistent paragraphs. This patch fixes it and does
> other fixes and updates:
>
> - s/migrate_all_tasks()/migrate_live_tasks()/
> - describe more cpuset control files
> - s/cpumask_t/struct cpumask/
> - document cpu hotplug and change of 'sched_relax_domain_level' may cause
> domain rebuild
> - document various ways to query and modify cpusets
> - the equivalent of "mount -t cpuset" is "mount -t cgroup -o cpuset,noprefix"
>
> Signed-off-by: Li Zefan <lizf@xxxxxxxxxxxxxx>
Acked-by: Randy Dunlap <randy.dunlap@xxxxxxxxxx>
Andrew, who should merge this?
> ---
>
> v1 -> v2: fixed 2 typos pointed out by Randy.
>
> ---
> Documentation/cgroups/cpusets.txt | 65 +++++++++++++++++++++----------------
> 1 files changed, 37 insertions(+), 28 deletions(-)
>
> diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt
> index 5c86c25..0611e95 100644
> --- a/Documentation/cgroups/cpusets.txt
> +++ b/Documentation/cgroups/cpusets.txt
> @@ -142,7 +142,7 @@ into the rest of the kernel, none in performance critical paths:
> - in fork and exit, to attach and detach a task from its cpuset.
> - in sched_setaffinity, to mask the requested CPUs by what's
> allowed in that tasks cpuset.
> - - in sched.c migrate_all_tasks(), to keep migrating tasks within
> + - in sched.c migrate_live_tasks(), to keep migrating tasks within
> the CPUs allowed by their cpuset, if possible.
> - in the mbind and set_mempolicy system calls, to mask the requested
> Memory Nodes by what's allowed in that tasks cpuset.
> @@ -175,6 +175,10 @@ files describing that cpuset:
> - mem_exclusive flag: is memory placement exclusive?
> - mem_hardwall flag: is memory allocation hardwalled
> - memory_pressure: measure of how much paging pressure in cpuset
> + - memory_spread_page flag: if set, spread page cache evenly on allowed nodes
> + - memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes
> + - sched_load_balance flag: if set, load balance within CPUs on that cpuset
> + - sched_relax_domain_level: the searching range when migrating tasks
>
> In addition, the root cpuset only has the following file:
> - memory_pressure_enabled flag: compute memory_pressure?
> @@ -252,7 +256,7 @@ is causing.
>
> This is useful both on tightly managed systems running a wide mix of
> submitted jobs, which may choose to terminate or re-prioritize jobs that
> -are trying to use more memory than allowed on the nodes assigned them,
> +are trying to use more memory than allowed on the nodes assigned to them,
> and with tightly coupled, long running, massively parallel scientific
> computing jobs that will dramatically fail to meet required performance
> goals if they start to use more memory than allowed to them.
> @@ -378,7 +382,7 @@ as cpusets and sched_setaffinity.
> The algorithmic cost of load balancing and its impact on key shared
> kernel data structures such as the task list increases more than
> linearly with the number of CPUs being balanced. So the scheduler
> -has support to partition the systems CPUs into a number of sched
> +has support to partition the systems CPUs into a number of sched
> domains such that it only load balances within each sched domain.
> Each sched domain covers some subset of the CPUs in the system;
> no two sched domains overlap; some CPUs might not be in any sched
> @@ -485,17 +489,22 @@ of CPUs allowed to a cpuset having 'sched_load_balance' enabled.
> The internal kernel cpuset to scheduler interface passes from the
> cpuset code to the scheduler code a partition of the load balanced
> CPUs in the system. This partition is a set of subsets (represented
> -as an array of cpumask_t) of CPUs, pairwise disjoint, that cover all
> -the CPUs that must be load balanced.
> -
> -Whenever the 'sched_load_balance' flag changes, or CPUs come or go
> -from a cpuset with this flag enabled, or a cpuset with this flag
> -enabled is removed, the cpuset code builds a new such partition and
> -passes it to the scheduler sched domain setup code, to have the sched
> -domains rebuilt as necessary.
> +as an array of struct cpumask) of CPUs, pairwise disjoint, that cover
> +all the CPUs that must be load balanced.
> +
> +The cpuset code builds a new such partition and passes it to the
> +scheduler sched domain setup code, to have the sched domains rebuilt
> +as necessary, whenever:
> + - the 'sched_load_balance' flag of a cpuset with non-empty CPUs changes,
> + - or CPUs come or go from a cpuset with this flag enabled,
> + - or 'sched_relax_domain_level' value of a cpuset with non-empty CPUs
> + and with this flag enabled changes,
> + - or a cpuset with non-empty CPUs and with this flag enabled is removed,
> + - or a cpu is offlined/onlined.
>
> This partition exactly defines what sched domains the scheduler should
> -setup - one sched domain for each element (cpumask_t) in the partition.
> +setup - one sched domain for each element (struct cpumask) in the
> +partition.
>
> The scheduler remembers the currently active sched domain partitions.
> When the scheduler routine partition_sched_domains() is invoked from
> @@ -559,7 +568,7 @@ domain, the largest value among those is used. Be careful, if one
> requests 0 and others are -1 then 0 is used.
>
> Note that modifying this file will have both good and bad effects,
> -and whether it is acceptable or not will be depend on your situation.
> +and whether it is acceptable or not depends on your situation.
> Don't modify this file if you are not sure.
>
> If your situation is:
> @@ -600,19 +609,15 @@ to allocate a page of memory for that task.
>
> If a cpuset has its 'cpus' modified, then each task in that cpuset
> will have its allowed CPU placement changed immediately. Similarly,
> -if a tasks pid is written to a cpusets 'tasks' file, in either its
> -current cpuset or another cpuset, then its allowed CPU placement is
> -changed immediately. If such a task had been bound to some subset
> -of its cpuset using the sched_setaffinity() call, the task will be
> -allowed to run on any CPU allowed in its new cpuset, negating the
> -affect of the prior sched_setaffinity() call.
> +if a tasks pid is written to another cpusets 'tasks' file, then its
> +allowed CPU placement is changed immediately. If such a task had been
> +bound to some subset of its cpuset using the sched_setaffinity() call,
> +the task will be allowed to run on any CPU allowed in its new cpuset,
> +negating the effect of the prior sched_setaffinity() call.
>
> In summary, the memory placement of a task whose cpuset is changed is
> updated by the kernel, on the next allocation of a page for that task,
> -but the processor placement is not updated, until that tasks pid is
> -rewritten to the 'tasks' file of its cpuset. This is done to avoid
> -impacting the scheduler code in the kernel with a check for changes
> -in a tasks processor placement.
> +and the processor placement is updated immediately.
>
> Normally, once a page is allocated (given a physical page
> of main memory) then that page stays on whatever node it
> @@ -681,10 +686,14 @@ and then start a subshell 'sh' in that cpuset:
> # The next line should display '/Charlie'
> cat /proc/self/cpuset
>
> -In the future, a C library interface to cpusets will likely be
> -available. For now, the only way to query or modify cpusets is
> -via the cpuset file system, using the various cd, mkdir, echo, cat,
> -rmdir commands from the shell, or their equivalent from C.
> +There are ways to query or modify cpusets:
> + - via the cpuset file system directly, using the various cd, mkdir, echo,
> + cat, rmdir commands from the shell, or their equivalent from C.
> + - via the C library libcpuset.
> + - via the C library libcgroup.
> + (http://sourceforge.net/proects/libcg/)
> + - via the python application cset.
> + (http://developer.novell.com/wiki/index.php/Cpuset)
>
> The sched_setaffinity calls can also be done at the shell prompt using
> SGI's runon or Robert Love's taskset. The mbind and set_mempolicy
> @@ -756,7 +765,7 @@ mount -t cpuset X /dev/cpuset
>
> is equivalent to
>
> -mount -t cgroup -ocpuset X /dev/cpuset
> +mount -t cgroup -ocpuset,noprefix X /dev/cpuset
> echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent
>
> 2.2 Adding/removing cpus
--
~Randy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/