Re: Improving OOM killer

From: David Rientjes
Date: Wed Feb 03 2010 - 13:58:17 EST


On Wed, 3 Feb 2010, Balbir Singh wrote:

> > IIRC the child accumulating code was introduced to deal with
> > malicious code (fork bombs), but it makes things worse for the
> > (much more common) situation of a system without malicious
> > code simply running out of memory due to being very busy.
> >
>
> For fork bombs, we could do a number of children number test and have
> a threshold before we consider a process and its children for
> badness().
>

Yes, we could look for the number of children with seperate mm's and then
penalize those threads that have forked an egregious amount, say, 500
tasks. I think we should check for this threshold within the badness()
heuristic to identify such forkbombs and not limit it only to certain
applications.

My rewrite for the badness() heuristic is centered on the idea that scores
should range from 0 to 1000, 0 meaning "never kill this task" and 1000
meaning "kill this task first." The baseline for a thread, p, may be
something like this:

unsigned int badness(struct task_struct *p,
unsigned long totalram)
{
struct task_struct *child;
struct mm_struct *mm;
int forkcount = 0;
long points;

task_lock(p);
mm = p->mm;
if (!mm) {
task_unlock(p);
return 0;
}
points = (get_mm_rss(mm) +
get_mm_counter(mm, MM_SWAPENTS)) * 1000 /
totalram;
task_unlock(p);

list_for_each_entry(child, &p->children, sibling)
/* No lock, child->mm won't be dereferenced */
if (child->mm && child->mm != mm)
forkcount++;

/* Forkbombs get penalized 10% of available RAM */
if (forkcount > 500)
points += 100;

...

/*
* /proc/pid/oom_adj ranges from -1000 to +1000 to either
* completely disable oom killing or always prefer it.
*/
points += p->signal->oom_adj;

if (points < 0)
return 0;
return (points <= 1000) ? points : 1000;
}

static struct task_struct *select_bad_process(...,
nodemask_t *nodemask)
{
struct task_struct *p;
unsigned long totalram = 0;
int nid;

for_each_node_mask(nid, nodemask)
totalram += NODE_DATA(nid)->node_present_pages;

for_each_process(p) {
unsigned int points;

...

if (!nodes_intersects(p->mems_allowed, nodemasks))
continue;

...
points = badness(p, totalram);
...
}
...
}

In this example, /proc/pid/oom_adj now ranges from -1000 to +1000, with
OOM_DISABLE being -1000, to polarize tasks for oom killing or determine
when a task is leaking memory because it is using far more memory than it
should. The nodemask passed from the page allocator should be intersected
with current->mems_allowed within the oom killer; userspace is then fully
aware of what value is an egregious amount of RAM for a task to consume,
including information it knows about the task's cpuset or mempolicy. For
example, it would be very simple for a user to set an oom_adj of -500,
which means "we discount 50% of the task's allowed memory from being
considered in the heuristic" or +500, which means "we always allow all
other system/cpuset/mempolicy tasks to use at least 50% more allowed
memory than this one."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/