Re: [PATCH v3 1/2] mm: introduce process_mrelease system call

From: Michal Hocko
Date: Mon Jul 26 2021 - 03:27:23 EST


On Fri 23-07-21 10:00:26, Shakeel Butt wrote:
> On Fri, Jul 23, 2021 at 9:09 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
> >
> > On Fri, Jul 23, 2021 at 6:46 AM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote:
> > >
> > > On Fri, Jul 23, 2021 at 1:53 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
> > > >
> > > [...]
> > > > > However
> > > > > retrying means issuing another syscall, so additional overhead...
> > > > > I guess such "best effort" approach would be unusual for a syscall, so
> > > > > maybe we can keep it as it is now and if such "do not block" mode is needed
> > > > > we can use flags to implement it later?
> > > >
> > > > Yeah, an explicit opt-in via flags would be an option if that turns out
> > > > to be really necessary.
> > > >
> > >
> > > I am fine with keeping it as it is but we do need the non-blocking
> > > option (via flags) to enable userspace to act more aggressively.
> >
> > I think you want to check memory conditions shortly after issuing
> > kill/reap requests irrespective of mmap_sem contention. The reason is
> > that even when memory release is not blocked, allocations from other
> > processes might consume memory faster than we release it. For example,
> > in Android we issue kill and start waiting on pidfd for its death
> > notification. As soon as the process is dead we reassess the situation
> > and possibly kill again. If the process is not dead within a
> > configurable timeout we check conditions again and might issue more
> > kill requests (IOW our wait for the process to die has a timeout). If
> > process_mrelease() is blocked on mmap_sem, we might timeout like this.
> > I imagine that a non-blocking option for process_mrelease() would not
> > really change this logic.
>
> On a containerized system, killing a job requires killing multiple
> processes and then process_mrelease() them. Now there is cgroup.kill
> to kill all the processes in a cgroup tree but we would still need to
> process_mrelease() all the processes in that tree.

Is process_mrelease on all of them really necessary? I thought that the
primary reason for the call is to guarantee a forward progress in cases
where the userspace OOM victim cannot die on SIGKILL. That should be
more an exception than a normal case, no?

> There is a chance
> that we get stuck in reaping the early process. Making
> process_mrelease() non-blocking will enable the userspace to go to
> other processes in the list.

I do agree that allowing (guanrateed) non-blocking behavior is nice but
it is also a rather strong promise. There is some memory that cannot be
released by the oom reaper currently because there are locks involved
(e.g. mlocked memory or memory areas backed by blocking notifiers).
I can imagine some users of this api would rather block and make sure to
release the memory rather than skip over it. So if anything this has to
be an opt in with a big fat warning that the behavior of the kernel wrt
to releasable memory can vary due to all sorts of implementation
details.

> An alternative would be to have a cgroup specific interface for
> reaping similar to cgroup.kill.

Could you elaborate?

--
Michal Hocko
SUSE Labs