Re: [PATCH] mm: avoid unnecessary OOM kills

From: Dave Peterson
Date: Thu May 18 2006 - 22:39:25 EST

At 01:05 PM 5/18/2006, Andrew Morton wrote:
>> /* linux/mm/oom_kill.c */
>> +extern volatile unsigned long oom_kill_in_progress;
>This shouldn't be volatile.

Yes, I'll fix that.

>> +/*
>> + * Attempt to start an OOM kill operation. Return 0 on success, or 1 if an
>> + * OOM kill is already in progress.
>> + */
>> +static inline int oom_kill_start(void)
>> +{
>> + return test_and_set_bit(0, &oom_kill_in_progress);
>> +}
>Suggest this be called oom_kill_trystart().

Sounds good. I'll change the name.

>> +volatile unsigned long oom_kill_in_progress = 0;
>This shouldn't be initialised to zero. The kernel zeroes bss at startup.

Ok, I'll fix this.

>> /**
>> * badness - calculate a numeric value for how bad this task has been
>> * @p: task struct of which task we should calculate
>> @@ -260,27 +262,31 @@
>> struct mm_struct *mm;
>> task_t * g, * q;
>> + task_lock(p);
>> mm = p->mm;
>> - /* WARNING: mm may not be dereferenced since we did not obtain its
>> - * value from get_task_mm(p). This is OK since all we need to do is
>> - * compare mm to q->mm below.
>> + if (mm == NULL || mm == &init_mm) {
>> + task_unlock(p);
>> + return 1;
>> + }
>> +
>> + set_bit(MM_FLAG_OOM_NOTIFY, &mm->flags);
>> + task_unlock(p);
>Putting task_lock() in here would be a fairly obvious way to address the
>fragility which that comment describes. But I have a feeling that we
>discussed that a couple of weeks ago and found a problem with it, didn't we?

I think task_lock() works fine. That's what I have above.
When I generated the patch, it looks like I omitted the "-p" flag
to diff. Thus it looks like the code changes above are in badness()
when they are actually in oom_kill_task(). My apologies (I know this
makes things hard to read). I'll fix this when I repost.

>So if process A did a successful oom_kill_start(), and processes B, C, D
>and E all get into difficulty as well, they'll go into busywait loops. The
>cond_resched() will eventually save us from locking up, but it's a bit

Hmm... that's a good point. I kind of assumed the other processes
would go to sleep somewhere inside one of the calls to
get_page_from_freelist(), but it looks like this is not the case.
Without my changes, all of the processes would call out_of_memory()
and then loop back around to "restart". However there's a call to
schedule_timeout_interruptible() in out_of_memory() that processes B,
C, D, and E will circumvent if my changes are applied. I'll try to
work out a solution to this and then repost.

On somewhat of a tangent, it seems less than ideal for the allocator
to loop when memory is scarce (even if we may sleep on each
iteration). An alternative might be something like this: When a task
calls __alloc_pages(), it goes through the zonelists looking for pages
to allocate. If this fails, the task adds itself to a queue and goes
to sleep. Then when kswapd() or someone else frees some pages, the
pages are given to queued tasks, which are then awakened with their
requests satisfied. The memory allocator might optionally choose to
awaken a task without satisfying its request. Then __alloc_pages()
would return NULL for that task.

I haven't thought through the details of implementing something like
this in Linux. Perhaps there are issues I'm not aware of that would
make the above approach impractical. However I'd be curious to hear
your thoughts.

>And I'm trying to work out where process A will run oom_kill_finish() if
>`cancel' ends up being 0 in out_of_memory(). afaict, it doesn't, and the
>oom-killer will be accidentally disabled?

In this case, mmput() will call oom_kill_finish() once the mm_users
count on the mm_struct of the OOM killer victim has reached zero and
the mm_struct has been cleaned out. Thus additional OOM kills are
prevented until the victim has freed its memory. Again, this is a bit
unclear because I messed up generating the patch. Sorry about that.
Will fix shortly...

I should probably also do a better job of adding comments along with
my code changes.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at