Re: [patch] oom: give current access to memory reserves if it hasbeen killed

From: Oleg Nesterov
Date: Thu Apr 01 2010 - 12:34:17 EST


On 04/01, David Rientjes wrote:
>
> On Thu, 1 Apr 2010, Oleg Nesterov wrote:
>
> > Why? You ignored this part:
> >
> > Say, right after exit_mm() we are doing acct_process(), and f_op->write()
> > needs a page. So, you are saying that in this case __page_cache_alloc()
> > can never trigger out_of_memory() ?
> >
> > why this is not possible?
> >
>
> It can, but the check for p->mm is sufficient since exit_notify()

Yes, but I meant out_of_memory()->__oom_kill_task(current). OK, we
already discussed this in the previous emails.

> We cannot rely on oom_badness() to filter this task because we still
> select it as our chosen task even with a badness score of 0 if !chosen

Yes, see another email from me.

> Your point about p->mm being non-NULL for kthreads using use_mm() is
> taken, we should probably just change the is_global_init() check in
> select_bad_process() to p->flags & PF_KTHREAD and ensure we reject
> oom_kill_process() for them.

Yes, but we have to check both is_global_init() and PF_KTHREAD.

The "patch" I sent checks PF_KTHREAD in find_lock_task_mm(), but as I
said select_bad_process() is the better place.

> > OK, a bad user does
> >
> > int sleep_forever(void *)
> > {
> > pause();
> > }
> >
> > int main(void)
> > {
> > pthread_create(sleep_forever);
> > syscall(__NR_exit);
> > }
> >
> > Now, every time select_bad_process() is called it will find this process
> > and PF_EXITING is true, so it just returns ERR_PTR(-1UL). And note that
> > this process is not going to exit.
> >
>
> Hmm, so it looks like we need to filter on !p->mm before checking for
> PF_EXITING so that tasks that are EXIT_ZOMBIE won't make the oom killer
> into a no-op.

As it was already discussed, it is not easy to check !p->mm. Once
again, we must not filter out the task just because its ->mm == NULL.

Probably the best change for now is

- if (p->flags & PF_EXITING) {
+ if (p->flags & PF_EXITING && p->mm) {

This is not perfect too, but much better.

> > > > Say, oom_forkbomb_penalty() does list_for_each_entry(tsk->children).
> > > > Again, this is not right even if we forget about !child->mm check.
> > > > This list_for_each_entry() can only see the processes forked by the
> > > > main thread.
> > > >
> > >
> > > That's the intention.
> >
> > Why? shouldn't oom_badness() return the same result for any thread
> > in thread group? We should take all childs into account.
> >
>
> oom_forkbomb_penalty() only cares about first-descendant children that
> do not share the same memory,

I see, but the code doesn't really do this. I mean, it doesn't really
see the first-descendant children, only those which were forked by the
main thread.

Look. We have a main thread M and the sub-thread T. T forks a lot of
processes which use a lot of memory. These processes _are_ the first
descendant children of the M+T thread group, they should be accounted.
But M->children list is empty.

oom_forkbomb_penalty() and oom_kill_process() should do

t = tsk;
do {
list_for_each_entry(child, &t->children, sibling) {
... take child into account ...
}
} while_each_thread(tsk, t);


> > > > Hmm. Why oom_forkbomb_penalty() does thread_group_cputime() under
> > > > task_lock() ? It seems, ->alloc_lock() is only needed for get_mm_rss().
> > > >
> [...snip...]
> We need task_lock() to ensure child->mm hasn't detached between the check
> for child->mm == tsk->mm and get_mm_rss(child->mm). So I'm not sure what
> you're trying to improve with this variation, it's a tradeoff between
> calling thread_group_cputime() under task_lock() for a subset of a task's
> threads when we already need to hold task_lock() anyway vs. calling it for
> all threads unconditionally.

See the patch below. Yes, this is minor, but it is always good to avoid
the unnecessary locks, and thread_group_cputime() is O(N).

Not only for performance reasons. This allows to change the locking in
thread_group_cputime() if needed without fear to deadlock with task_lock().

Oleg.

--- x/mm/oom_kill.c
+++ x/mm/oom_kill.c
@@ -97,13 +97,16 @@ static unsigned long oom_forkbomb_penalt
return 0;
list_for_each_entry(child, &tsk->children, sibling) {
struct task_cputime task_time;
- unsigned long runtime;
+ unsigned long runtime, this_rss;

task_lock(child);
if (!child->mm || child->mm == tsk->mm) {
task_unlock(child);
continue;
}
+ this_rss = get_mm_rss(child->mm);
+ task_unlock(child);
+
thread_group_cputime(child, &task_time);
runtime = cputime_to_jiffies(task_time.utime) +
cputime_to_jiffies(task_time.stime);
@@ -113,10 +116,9 @@ static unsigned long oom_forkbomb_penalt
* get to execute at all in such cases anyway.
*/
if (runtime < HZ) {
- child_rss += get_mm_rss(child->mm);
+ child_rss += this_rss;
forkcount++;
}
- task_unlock(child);
}

/*

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/