Re: [PATCH 3/6] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj

From: Michal Hocko
Date: Mon May 30 2016 - 05:39:58 EST


On Mon 30-05-16 11:47:53, Vladimir Davydov wrote:
> On Mon, May 30, 2016 at 09:07:05AM +0200, Michal Hocko wrote:
> > On Fri 27-05-16 19:18:21, Vladimir Davydov wrote:
> > > On Fri, May 27, 2016 at 01:18:03PM +0200, Michal Hocko wrote:
> > > ...
> > > > @@ -1087,7 +1105,25 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
> > > > unlock_task_sighand(task, &flags);
> > > > err_put_task:
> > > > put_task_struct(task);
> > > > +
> > > > + if (mm) {
> > > > + struct task_struct *p;
> > > > +
> > > > + rcu_read_lock();
> > > > + for_each_process(p) {
> > > > + task_lock(p);
> > > > + if (!p->vfork_done && process_shares_mm(p, mm)) {
> > > > + p->signal->oom_score_adj = oom_adj;
> > > > + if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
> > > > + p->signal->oom_score_adj_min = (short)oom_adj;
> > > > + }
> > > > + task_unlock(p);
> > >
> > > I.e. you write to /proc/pid1/oom_score_adj and get
> > > /proc/pid2/oom_score_adj updated if pid1 and pid2 share mm?
> > > IMO that looks unexpected from userspace pov.
> >
> > How much different it is from threads in the same thread group?
> > Processes sharing the mm without signals is a rather weird threading
> > model isn't it?
>
> I think so too. I wouldn't be surprised if it turned out that nobody had
> ever used it. But may be there's someone out there who does.

I have heard some rumors about users. But I haven't heard anything about
their oom_score_adj usage patterns.

> > Currently we just lie to users about their oom_score_adj
> > in this weird corner case.
>
> Hmm, looks like a bug, but nobody has ever complained about it.

Yes and that leads me to a suspicion that we can do that. Maybe I should
just add a note into the log that we are doing that so that people can
complain? Something like the following
diff --git a/fs/proc/base.c b/fs/proc/base.c
index fa0b3ca94dfb..7f3495415719 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1104,7 +1104,6 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
err_sighand:
unlock_task_sighand(task, &flags);
err_put_task:
- put_task_struct(task);

if (mm) {
struct task_struct *p;
@@ -1113,6 +1112,10 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
for_each_process(p) {
task_lock(p);
if (!p->vfork_done && process_shares_mm(p, mm)) {
+ pr_info("updating oom_score_adj for %d (%s) from %d to %d because it shares mm with %d (%s). Report if this is unexpected.\n",
+ task_pid_nr(p), p->comm,
+ p->signal->oom_score_adj, oom_adj,
+ task_pid_nr(task), task->comm);
p->signal->oom_score_adj = oom_adj;
if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
p->signal->oom_score_adj_min = (short)oom_adj;
@@ -1122,6 +1125,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
rcu_read_unlock();
mmdrop(mm);
}
+ put_task_struct(task);
out:
mutex_unlock(&oom_adj_mutex);
return err;

> > The only exception was OOM_SCORE_ADJ_MIN
> > where we really didn't kill the task but all other values are simply
> > ignored in practice.
> >
> > > May be, we'd better add mm->oom_score_adj and set it to the min
> > > signal->oom_score_adj over all processes sharing it? This would
> > > require iterating over all processes every time oom_score_adj gets
> > > updated, but that's a slow path.
> >
> > Not sure I understand. So you would prefer that mm->oom_score_adj might
> > disagree with p->signal->oom_score_adj?
>
> No, I wouldn't. I'd rather agree that oom_score_adj should be per mm,
> because we choose the victim basing solely on mm stats.
>
> What I mean is we don't touch p->signal->oom_score_adj of other tasks
> sharing mm, but instead store minimal oom_score_adj over all tasks
> sharing mm in the mm_struct whenever a task's oom_score_adj is modified.
> And use mm->oom_score_adj instead of signal->oom_score_adj in oom killer
> code. This would save us from any accusations of user API modifications
> and it would also make the oom code a bit easier to follow IMHO.

I understand your point but this is essentially lying because we
consider a different value than the user can observe in userspace.
Consider somebody doing insanity like

current->oom_score_adj = OOM_SCORE_ADJ_MIN
p = clone(CLONE_VM)
p->oom_score_adj = OOM_SCORE_ADJ_MAX

so one process would want to be always selected while the other one
doesn't want to get killed. All they can see is that everything is
put in place until the oom killer comes over and ignores that.

I think we should just be explicit. Maybe we want to treat
OOM_SCORE_ADJ_MIN special - e.g. do not even try to set oom_score_adj if
one of the sharing tasks is oom disabled. But I would rather wait for
somebody to complain and explain why the usecase really makes sense than
be all silent with implicit behavior.

Btw. we have already had per mm oom_core_adj but we had to revert it due
to vfork behavior. See 0753ba01e126 ("mm: revert "oom: move oom_adj
value""). This patch gets us back except it handles the vfork issue.
--
Michal Hocko
SUSE Labs