Re: [RFC 2/2] OOM, PM: make OOM detection in the freezer path raceless

From: Michal Hocko
Date: Thu Dec 04 2014 - 11:56:14 EST


On Thu 04-12-14 09:44:54, Tejun Heo wrote:
> On Thu, Dec 04, 2014 at 03:16:23PM +0100, Michal Hocko wrote:
> > > A delta but shouldn't it be pr_cont()?
> >
> > kernel/power/process.c doesn't use pr_* so I've stayed with what the
> > rest of the file is using. I can add a patch which transforms all of
> > them.
>
> The console output becomes wrong when printk() is used on
> continuation. So, yeah, it'd be great to fix it.
>
> > > > +extern bool oom_killer_disabled;
> > >
> > > Ugh... don't we wanna put this in a header file?
> >
> > Who else would need the declaration? This is not something random code
> > should look at.
>
> Let's say, somebody changes the type to ulong for whatever reason
> later and forgets to update this declaration. What happens then on a
> big endian machine?

OK, see your point. Although this is unlikely...

> Jesus, this is basic C programming. You don't sprinkle external
> declarations which the compiler can't verify against the actual
> definitions. There's absolutely no compelling reason to do that here.
> Why would you take out compiler verification for no reason?
>
> > > > +void mark_tsk_oom_victim(struct task_struct *tsk)
> > > > {
> > > > - return atomic_read(&oom_kills);
> > > > + BUG_ON(oom_killer_disabled);
> > >
> > > WARN_ON_ONCE() is prolly a better option here?
> >
> > Well, something fishy is going on when oom_killer_disabled is set and we
> > mark new OOM victim. This is a clear bug. Why would be warning and a
> > allow the follow up breakage?
>
> Because the system is more likely to be able to go on and we don't BUG
> when we can WARN as a general rule. Working systems is almost always
> better than a dead system even for debugging.
>
> > > > + if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
> > >
> > > Can a task actually be selected as an OOM victim multiple times?
> >
> > AFAICS nothing prevents from global OOM and memcg OOM killers racing.
>
> Maybe it'd be a good idea to note that in the comment?

ok

> > > > -void note_oom_kill(void)
> > > > +void unmark_tsk_oom_victim(struct task_struct *tsk)
> > > > {
> > > > - atomic_inc(&oom_kills);
> > > > + int count;
> > > > +
> > > > + if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
> > > > + return;
> > >
> > > Maybe test this inline in exit_mm()? e.g.
> > >
> > > if (test_thread_flag(TIF_MEMDIE))
> > > unmark_tsk_oom_victim(current);
> >
> > Why do you think testing TIF_MEMDIE in exit_mm is better? I would like
> > to reduce the usage of the flag as much as possible.
>
> Because it's adding a function call/return to hot path for everybody.
> It sure is a miniscule cost but we're adding that for no good reason.

ok.

> > > So, each complete() increments the done count and wait decs. The
> > > above code works iff the complete()'s and wait()'s are always balanced
> > > which usually isn't true in this type of wait code. Either use
> > > reinit_completion() / complete_all() combos or wait_event().
> >
> > Hmm, I thought that only a single instance of freeze_kernel_threads
> > (which calls oom_killer_disable) can run at a time. But I am currently
> > not sure that all paths are called under lock_system_sleep.
> > I am not familiar with reinit_completion API. Is the following correct?
>
> Hmmm... wouldn't wait_event() easier to read in this case?

OK, it looks easier. I thought it would require some additional
synchronization between wake up and wait but everything necessary seems
to be done in wait_event already so we cannot miss a wake up AFAICS:

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1d55ab12792f..032be9d2a239 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -408,7 +408,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
* Number of OOM victims in flight
*/
static atomic_t oom_victims = ATOMIC_INIT(0);
-static DECLARE_COMPLETION(oom_victims_wait);
+static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);

bool oom_killer_disabled __read_mostly;
static DECLARE_RWSEM(oom_sem);
@@ -435,7 +435,7 @@ void unmark_tsk_oom_victim(void)
* is nobody who cares.
*/
if (!atomic_dec_return(&oom_victims) && oom_killer_disabled)
- complete_all(&oom_victims_wait);
+ wake_up_all(&oom_victims_wait);
up_read(&oom_sem);
}

@@ -464,16 +464,11 @@ bool oom_killer_disable(void)
return false;
}

- /* unmark_tsk_oom_victim is calling complete_all */
- if (!oom_killer_disable)
- reinit_completion(&oom_victims_wait);
-
oom_killer_disabled = true;
- count = atomic_read(&oom_victims);
up_write(&oom_sem);

if (count)
- wait_for_completion(&oom_victims_wait);
+ wait_event(oom_victims_wait, atomic_read(&oom_victims));

return true;
}

> ...
> > > Maybe 0 / -errno is better choice as return values?
> >
> > I do not have problem to change this if you feel strong about it but
> > true/false sounds easier to me and it allows the caller to decide what to
> > report. If there were multiple reasons to fail then sure but that is not
> > the case.
>
> It's not a big deal but except for functions which have clear boolean
> behavior - functions which try/attempt something or query or decide

this is basically try_lock which might fail due to whatever internal
reasons.

> certain things - randomly thrown in bool returns tend to become
> confusing especially because its bool fail value is the opposite of
> 0/-errno fail value. So, "this function only fails with one reason"
> is usually a bad and arbitrary reason for choosing bool return which
> causes confusion on callsites and headaches when the function develops
> more reasons to fail.
>
> ...
> > > > @@ -712,12 +770,16 @@ void pagefault_out_of_memory(void)
> > > > {
> > > > struct zonelist *zonelist;
> > > >
> > > > + down_read(&oom_sem);
> > > > if (mem_cgroup_oom_synchronize(true))
> > > > - return;
> > > > + goto unlock;
> > > >
> > > > zonelist = node_zonelist(first_memory_node, GFP_KERNEL);
> > > > if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) {
> > > > - out_of_memory(NULL, 0, 0, NULL, false);
> > > > + if (!oom_killer_disabled)
> > > > + __out_of_memory(NULL, 0, 0, NULL, false);
> > > > oom_zonelist_unlock(zonelist, GFP_KERNEL);
> > >
> > > Is this a condition which can happen and we can deal with? With
> > > userland fully frozen, there shouldn't be page faults which lead to
> > > memory allocation, right?
> >
> > Except for racing OOM victims which were missed by try_to_freeze_tasks
> > because they didn't get cpu slice to wake up from the freezer. The task
> > would die on the way out from the page fault exception. I have updated
> > the changelog to be more verbose about this.
>
> That's something very not obvious. Let's please add a comment
> explaining that.

@@ -778,6 +795,15 @@ void pagefault_out_of_memory(void)
if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) {
if (!oom_killer_disabled)
__out_of_memory(NULL, 0, 0, NULL, false);
+ else
+ /*
+ * There shouldn't be any user tasks runable while the
+ * OOM killer is disabled so the current task has to
+ * be a racing OOM victim for which oom_killer_disable()
+ * is waiting for.
+ */
+ WARN_ON(test_thread_flag(TIF_MEMDIE));
+
oom_zonelist_unlock(zonelist, GFP_KERNEL);
}
unlock:

>
> > > (it only makes sense while the whole system is in quiescent state)
> > > and at least trigger WARN_ON_ONCE() if the above code path gets
> > > triggered while oom killer is disabled?
> >
> > I can add a WARN_ON(!test_thread_flag(tsk, TIF_MEMDIE)).
>
> Yeah, that makes sense to me.
>
> Thanks.
>
> --
> tejun

--
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/