Re: [PATCH v5] mm/oom_kill.c: futex: Close a race between do_exit and the oom_reaper

From: Michal Hocko
Date: Wed Mar 23 2022 - 05:17:34 EST


Let me skip over futex part which I need to digest and only focus on the
oom side of the things for clarification.

On Tue 22-03-22 23:43:18, Thomas Gleixner wrote:
[...]
> > While some places can be handled by changing uninterruptible waiting to
> > killable there are places which are not really fixable, e.g. lock
> > chain dependency which leads to memory allocation.
>
> I'm not following. Which lock chain dependency causes memory allocation?

Consider an oom victim is blocked on a lock or waiting for an event to
happen but the lock holder is stuck allocating or the wake up depends on
an allocation. Many sleeping locks are doing GFP_KERNEL allocations.

> Also aren't oom killed tasks allowed to allocate from emergency buffers
> in order to clean themself up? That's at least what the comments in the
> oom code claim.

Yes, this is the case. And this is a slightly easier scenario. A dip
into memory reserves could help to make a forward progress indeed. But
as mentioned above the victim can be doing something completely else.

> Aside of that. Do you have call chains which show in which situation
> such a stuck task is?

I do remember Tetsuo was really good at triggering those issues. It was
no rocket science. Essentially a memory stress test hammering FS code
and usually triggering dependencies of oom victims either on the lock or
wait queue which is stuck allocating.

I have managed to push details about specific locks or traces out of my
brain but just by a high level look and checking how many GFP_KERNEL
allocations are done from inside a mutex or semaphore which shared among
process the dependency chain is simply unavoidable.

[...]

> You can easily validate that by doing:
>
> wake_oom_reaper(task)
> task->reap_time = jiffies + HZ;
> queue_task(task);
> wakeup(reaper);
>
> and then:
>
> oom_reap_task(task)
> now = READ_ONCE(jiffies);
> if (time_before(now, task->reap_time)
> schedule_timeout_idle(task->reap_time - now);
>
> before trying to actually reap the mm.
>
> That will prevent the enforced race in most cases and allow the exiting
> and/or killed processes to cleanup themself. Not pretty, but it should
> reduce the chance of the reaper to win the race with the exiting and/or
> killed process significantly.
>
> It's not going to work when the problem is combined with a heavy VM
> overload situation which keeps a guest (or one/some it's vCPUs) away
> from being scheduled. See below for a discussion of guarantees.
>
> If it failed to do so when the sleep returns, then you still can reap
> it.

Yes, this is certainly an option. Please note that the oom_reaper is not
the only way to trigger this. process_mrelease syscall performs the same
operation from the userspace. Arguably process_mrelease could be used
sanely/correctly because the userspace oom killer can do pro-cleanup
steps before going to final SIGKILL & process_mrelease. One way would be
to send SIGTERM in the first step and allow the victim to perform its
cleanup.

> > It is fundamentally wrong to reap the memory which the exit path
> > really need to do a proper clean up.
>
> Depends on the guarantees you make. If you say preventing OOM starvation
> at the end is more important than a stale futex, then it's fine as long
> as it is properly documented.

Yes, this is the case we have made on process core dumps. They are
simply incomplete for oom victims. I thought that handling robust futex
lists requires some guarantees. More on that later

> That said, the robust list is no guarantee. It's a best effort approach
> which works well most of the time at least for the "normal" issues where
> a task holding a futex dies unexpectedly. But there is no guarantee that
> it works under all circumstances, e.g. OOM.

OK, so this is an important note. I am all fine by documenting this
restriction. It is not like oom victims couldn't cause other disruptions
by leaving inconsistent/stale state behind.

> Sorry Nico, but your understanding
>
> >> On Mon, Mar 21 2022 at 21:09, Nico Pache wrote:
> >> From my understanding, the whole point of the robust futex is to allow forward
> >> progress in an application in which the lock holder CAN
> >> crash/exit/oom.
>
> is based on wishful thinking. There is absolutely no guarantee made by
> robust futexes. Why?
>
> Because the kernel can neither make a guarantee vs. user space
> controlled content nor vs. dealing with user space stupidity, e.g. a
> runaway memory hog.
>
> This is _NOT_ what robust list is about. Robust list was introduced to
> handle the unfortunate case where a task holding a futex dies
> unexpectedly.
>
> Extending this to OOM and expecting it to work under all circumstances
> is really wishful thinking.
>
> >> So semantically nothing is wrong with killing the futex holder... the
> >> whole point of the robustness is to handle these cases.
>
> Wrong. The whole point of robust lists is to handle the "normal" case
> gracefully. A process being OOM killed is _NOT_ in the "normal"
> category.
>
> Neither is it "normal" that a VM is scheduled out long enough to miss a
> 1 second deadline. That might be considered normal by cloud folks, but
> that's absolute not normal from an OS POV. Again, that's not a OS
> problem, that's an operator/admin problem.

Thanks for this clarification. I would tend to agree. Following a
previous example that oom victims can leave inconsistent state behind
which can influence other processes. I am wondering what kind of
expectations about the lock protected state can we make when the holder
of the lock has been interrupted at any random place in the critical
section.

[...]
> > And just to be clear, this is clearly a bug in the oom_reaper per se.
> > Originally I thought that relaxing the locking (using trylock and
> > retry/bail out on failure) would help but as I've learned earlier this
> > day this is not really possible because of #PF at least. The most self
> > contained solution would be to skip over vmas which are backing the
> > robust list which would allow the regular exit path to do the proper
> > cleanup.
>
> That's not sufficient because you have to guarantee that the relevant
> shared futex is accessible. See the lock chain example above.

Yeah, my previous understanding was that the whole linked list lives in
the single mapping and we can just look at their addresses.

> OTOH, in combination with delaying the reaper skipping the VMAs, which
> contain a robust list head, might be good enough for the trivial
> cases where the shared futex is the one to which the robust list head
> points to.
>
> Emphasis on VMAs. You need to evaluate every tasks robust list pointer
> of the process not only the one of the task which gets queued for
> reaping.
>
> So let me summarize what I think needs to be done in priority order:
>
> #1 Delay the oom reaper so the normal case of a process being able to
> exit is not subject to a pointless race to death.
>
> #2 If #1 does not result in the desired outcome, reap the mm (which is
> what we have now).
>
> #3 If it's expected that #2 will allow the stuck process to make
> progress on the way towards cleanup, then do not reap any VMA
> containing a robust list head of a thread which belongs to the
> exiting and/or killed process.
>
> The remaining cases, i.e. the lock chain example I pointed out above or
> the stuck forever task are going to be rare and fall under the
> collateral damage and no guarantee rule.

I do agree that delaying oom_reaper wake up is the simplest thing to do
at this stage and it could catch up most of the failures. We still have
process_mrelease syscall case but I guess we can document this as a
caveat into the man page.

Thanks!

> Thanks,
>
> tglx
> ---
> P.S.: I so hoped that after my first encounter with the oom killer
> almost two decades ago I wouldn't have to deal with this horror
> again. Bah!

heh, a nice place full of horrors and hard to see dependencies

--
Michal Hocko
SUSE Labs