Re: [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag

From: Minchan Kim

Date: Fri May 15 2026 - 18:35:50 EST


On Fri, May 15, 2026 at 10:15:53PM +0200, Oleg Nesterov wrote:
> In fact I don't even understand the motivation...
>
> On 05/15, Christian Brauner wrote:
> >
> > On Mon, May 11, 2026 at 02:42:26PM -0700, Minchan Kim wrote:
> > > leaving the actual address space teardown (exit_mmap) to be deferred until
> > > the mm's reference count drops to zero. In the field (e.g., Android),
> > > arbitrary reference counts (reading /proc/<pid>/cmdline, or various other
> > > remote VM accesses) frequently delay this teardown indefinitely,
>
> Sure, get_task_cmdline() can delay mmput(). But indefinitely ?
>
> Perhaps the changelog could be more clear? I don't see how any remote VM access
> can pin mm->mm_users "indefinitely". Even if, say, a lot of threads read
> /proc/<pid>/cmdline in an endless loop in parallel...
>
> I must have missed something.

Thank you for the review and questions. You are entirely right that under normal
uncongested conditions, a /proc reader drops mmput() quickly.

However, on any heavily loaded system under severe memory/CPU pressure, this delay
can be long enough to cause cascading issues. Here is exactly how this occurs
and why it acts as an indefinite delay from an emergency reclaim perspective.

When memory pressure is critical, a userspace OOM killer terminates a large
victim process. Simultaneously, another process (such as a monitoring tool) is
reading /proc/<pid>/smaps or cmdline. Because the system is heavily loaded, the
reader thread on CPU C can get preempted or blocked while holding mmget().

When the dying victim executes exit_mm(), mm_users drops from 2 to 1. Thus,
exit_mmap() does not run. For hundreds of milliseconds or seconds, the memory
remains fully trapped. The userspace OOM policy sees that memory is still
critically low and unnecessarily kills additional innocent processes.

Here is the exact timing chart illustrating the existing problem and why
process_mrelease() fails in this scenario:

CPU A (Userspace OOM Killer) CPU B (Victim Task) CPU C (/proc Reader)
---------------------------- ------------------- --------------------
open(/proc/pid/smaps)
get_task_mm()
[mm_users++ => 2]
(Preempted/Stalled)
|
1. Sends SIGKILL |
2. Victim receives SIGKILL |
do_exit() |
exit_mm() |
task->mm = NULL |
mmput() [mm_users => 1] |
(Memory NOT freed!) |
|
3. Calls process_mrelease() |
|
find_lock_task_mm() sees task->mm == NULL |
Returns -ESRCH. Reaping fails! |
(Memory remains trapped until CPU C finally finishes!) <==========/

I hope thisclarifies the motivation and mechanics behind this issue.