Re: [PATCH] 0/2 Buddy allocator with placement policy (Version 9)+ prezeroing (Version 4)

From: Mel Gorman
Date: Mon Mar 14 2005 - 09:12:23 EST

On Thu, 10 Mar 2005, Dave Hansen wrote:

> On Thu, 2005-03-10 at 09:22 -0800, Paul Jackson wrote:
> > In particular, I am working on preparing a patch proposal for a policy
> > that would kill a task rather than invoke the swapper. In
> > mm/page_alloc.c __alloc_pages(), if one gets down to the point of being
> > about to kick the swapper, if this policy is enabled (and you're not
> > in_interrupt() and don't have flag PF_MEMALLOC set), then ask
> > oom_kill_task() to shoot us instead. For some big HPC jobs that are
> > carefully sized to fit on the allowed memory nodes, swapping is a fate
> > worse than death.
> >
> > The natural place (for me, anyway) to hang such policies is off the
> > cpuset.
> >

I have not read up on cpuset before so I am assuming you are talking about so correct me if I am wrong.

> > I am hopeful that cpusets will soon hit Linus's tree.
> >
> > Would it make sense to specify these buddy allocator fallback policies
> > per cpuset as well?
> That seems reasonable, but I don't think there necessarily be enough
> cpuset users to make this reasonable as the only interface.
> Seems like something VMA-based along the lines of madvise() or the NUMA
> binding API would be more appropriate. Perhaps default policies
> inherited from a cpuset, but overridden by other APIs would be a good
> compromise.

I would think that VMA is too fine-grained and there is not a really clean
way of specifying what fallback policy to use with madvise() unless we
hard-code a fixed number of policies. I agree that if cpuset is not
widely used, it should not be the only way of setting policy. However, the
NUMA binding API deals with memory ranges, not PIDs. Implementing the
fallbacks for memory ranges would be at the node or zone level, not PID
which could be a conflict with cpuset (for example, which takes priority
when both cpuset and node policies are set?). So, I don't think the numa
binding as it is today is exactly the way to go either.

First though, we have to define how a fallback policy would be implemented
and applied. Right now, there is one hard-coded policy (assuming that the
placement policy gets merged at some point in the future) that is
implemented in __rmqueue() and is applied at the zone level. It is
implemented as a while loop and the information it uses is;

Input: int[] Integer array of fallback allocation types
struct zone * zone is the zone being allocated from
int order being the required order

Output: struct free_area *area is the area we are going to allocate from
int current_order is the order of the free pages in area

So, a very rough fallback policy setup might look something like;

* Fallback function fills this struct telling the allocator where to get
* free pages from
struct fallback {
struct free_area *area;
unsigned long current_order;

/* struct that describes a fallback policy */
struct fallback_policy {
void (*fallback)(int *fallback_list, struct zone* zone, int order);
char name[10];
int id; /* set on register */

/* List of all available fallback policies */
struct list_head *fallback_policies;

/* Default policy to use */
struct fallback_policy default_policy = {
.fallback = fallback_lowfrag,
.name = "default",
.id = 0

* Register a new fallback policy. Throws a wobbly if the name is already
* registered
void register_fallback(struct fallback_policy *policy);

/* Unregister a fallback, calls BUG if policy == default_policy */
void unregister_fallback(struct fallback_policy *policy);

the fallback_lowfrag() function would just implement the current fallback
approach. Other users like hotplug or cpuset would create their own
fallback policy, populate a fallback_policy and register it. I think this
is similar to how IO schedulers are registered.

Next... How do we apply it.

I think the sensible level to have a fallback policy as at the mm_struct
level. If the mm does not have a pointer to a fallback_policy, it uses
the default. This would be inherited from either a cpuset or explicitly
set via a specific API (should it inherit across exec()? Initially, I
would think no)

Case 1: The cpuset case

cpuset has a pseudo filesystem called cfs that looks very like /sys
mounted on /dev/cpuset. Each directory under /dev/cpuset is a cpuset and
there are a number of files there. For fallbacks, a new file would be
there called fallback. Catting it would print something like

[default] hotplug noswapper

where the name between [] is what is currently set. To set a new fallback
policy, echo the new string name to fallback like

echo noswapper > fallback

Any process created with pexec() or is otherwise part of a cpuset will get
it's policy from here.

Case 2: Specific API

The NUMA binding API has a method that looks like this;

#include <linux/mempolicy.h>
int set_mempolicy(int policy, unsigned long *nodemask, unsigned long

We could have something like

int set_mempolicy_fallback(int pid, char *policy_name);

Straight off, I don't like that policy_name is a string. Maybe, we would
export a mapping of ID to string names via /sys although it means users of
the API would have to parse /sys first which is undesirable.

Case 3: Dirty hack via /proc

We could have a proc entry like /proc/pid/fallback which behaves the same
as the cfs entry. I have a funny feeling that no one will go for this for
anything other than a proof of concept.

> I have the feeling that applications will want to give the same kind of
> notifications for swapping as they would for memory hotplug operations
> as well. In decreasing order of pickiness:
> 1. Don't touch me at all
> 2. It's OK to migrate these pages elsewhere on this node
> 3. It's OK to migrate these pages anywhere
> 4. It's OK to swap these pages out

I think this is a separate issue because it affects page replacement both
locally and globally as well as allocation fallback policy.

Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at