Re: [PATCH 0/2] oom, memcg: do not report racy no-eligible OOM

From: Tetsuo Handa
Date: Fri Jan 11 2019 - 07:41:04 EST


On 2019/01/11 20:33, Michal Hocko wrote:
> On Fri 11-01-19 19:25:22, Tetsuo Handa wrote:
>> On 2019/01/11 8:59, Tetsuo Handa wrote:
>>> Michal Hocko wrote:
>>>> On Wed 09-01-19 20:34:46, Tetsuo Handa wrote:
>>>>> On 2019/01/09 20:03, Michal Hocko wrote:
>>>>>> Tetsuo,
>>>>>> can you confirm that these two patches are fixing the issue you have
>>>>>> reported please?
>>>>>>
>>>>>
>>>>> My patch fixes the issue better than your "[PATCH 2/2] memcg: do not
>>>>> report racy no-eligible OOM tasks" does.
>>>>
>>>> OK, so we are stuck again. Hooray!
>>>
>>> Andrew, will you pick up "[PATCH 3/2] memcg: Facilitate termination of memcg OOM victims." ?
>>> Since mm-oom-marks-all-killed-tasks-as-oom-victims.patch does not call mark_oom_victim()
>>> when task_will_free_mem() == true, memcg-do-not-report-racy-no-eligible-oom-tasks.patch
>>> does not close the race whereas my patch closes the race better.
>>>
>>
>> I confirmed that mm-oom-marks-all-killed-tasks-as-oom-victims.patch and
>> memcg-do-not-report-racy-no-eligible-oom-tasks.patch are completely failing
>> to fix the issue I am reporting. :-(
>
> OK, this is really interesting. This means that we are racing
> when marking all the tasks sharing the mm with the clone syscall.

Nothing interesting. This is NOT a race between clone() and the OOM killer. :-(
By the moment the OOM killer is invoked, all clone() requests are already completed.

Did you notice that there is no

"Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n"

line between

[ 71.304703][ T9694] Memory cgroup out of memory: Kill process 9692 (a.out) score 904 or sacrifice child

and

[ 71.309149][ T54] oom_reaper: reaped process 9750 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:185532kB

? Then, you will find that [ T9694] failed to reach for_each_process(p) loop inside
__oom_kill_process() in the first round of out_of_memory() call because
find_lock_task_mm() == NULL at __oom_kill_process() because Ctrl-C made that victim
complete exit_mm() before find_lock_task_mm() is called. Then, in the second round
of out_of_memory() call, [ T9750] (which is fatal_signal_pending() == T &&
tsk_is_oom_victim() == F) hit task_will_free_mem(current) path and called
mark_oom_victim() and woke up the OOM reaper. Then, before the third round of
out_of_memory() call starts, the OOM reaper set MMF_OOM_SKIP. When the third round
of out_of_memory() call started, [ T9748] could not hit task_will_free_mem(current)
path because MMF_OOM_SKIP was already set, and oom_badness() ignored any mm which
already has MMF_OOM_SKIP. As a result, [ T9748] failed to find a candidate. And this
step repeats for up to number of threads (213 times for this run).

> Does fatal_signal_pending handle this better?
>

Of course. My patch handles it perfectly. Even if we raced with clone() requests,
why do we need to care about threads doing clone() requests? Such threads are not
inside try_charge(), and therefore such threads can't contribute to this issue
by calling out_of_memory() from try_charge().