Re: [PATCH] 0/2 Buddy allocator with placement policy (Version 9) +prezeroing (Version 4)

From: Paul Jackson
Date: Mon Mar 14 2005 - 14:14:34 EST

Mel wrote:
> I have not read up on cpuset before so I am assuming you are talking about
> so correct me if I am wrong.

Yes - that. See also the kernel doc file:


> I agree that if cpuset is not
> widely used, it should not be the only way of setting policy.

Well ... cpusets just hit Linus tree four days ago, so I wouldn't expect
them to have achieved world domination quite yet ;).

Cpusets implement hardwall outer limits on cpu and memory usage. The
tasks assigned to a cpuset are only allowed to work within that cpuset.

Within a cpuset, a job may use sched_setaffinity, set_mempolicy, mbind
and madvise to make fine grained placement and related policy choices
however it chooses, subject to the broad, hard constraints of the cpuset.

The imposition of a policy that says a task can't swap is usually, at
least where I see it used, a hard constraint, imposed externally on an
entire job, for the well being of the rest of the system:

Waking up swappers imposes a burden on the rest of the
system, which some jobs must not be allowed to do.

And wasting further cpu cycles on a job that has exceeded
its allowed memory when it wasn't supposed to (and hence
no longer has any chance of substaining the in-memory
performance required of it) is a waste of possibly expensive
compute resources.

The natural place for such an externally imposed policy limiting overall
processor or memory usage by a group of tasks is, in my admittedly
biased view, the cpuset.

I envision a per-cpuset file, "policy_kill_no_swap", containing a
boolean "0" or "1" (actually, "0\n" or "1\n"). It defaults to "0". If
set to "1" (by writing "1" to that file) then if any task in that cpuset
gets far enough in the mm/page_alloc.c:__alloc_pages() code to initiate
swapping, that task is killed instead.

I don't see any need to have any other way of specifying this policy
preference by a per task call such as set_mempolicy(2). However if
others saw such a need, I'm open to considering it.

I don't view this fallback stuff like I see Mel describing it. I don't
see it as a passing a list of fallback alternatives to a single API.
Rather, each API need only specify one policy. The only place
'fallback' comes into play is if there are multiple API's (such as both
set_mempolicy and cpusets) that affect the same decision in the kernel
(such as whether to let a task invoke swapping, or to kill it instead).
The 'fallback' is the chose of what API takes precedence here. For
system wide imposed hardwall limitations, the cpuset should have its
policy enforced. Within those limits, finer grained calls such as
set_mempolicy should prevail.

So, if others did make the case for a second, per-task, way of
specifying this 'kill_no_swap' policy, then:
1) If a tasks cpuset policy_kill_no_swap is true, that prevails.
2) Otherwise the per-task setting of kill_no_swap prevails.

The choices of where migration is allowed is separate, in my view,
and deserves its own policy flags. I don't know what those flags
should be.

I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@xxxxxxxxxxxx> 1.650.933.1373, 1.925.600.0401
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at