Re: [PATCH v1 2/3] mm: process_mrelease: skip LRU movement for exclusive file folios

From: Minchan Kim

Date: Tue Apr 28 2026 - 21:19:37 EST

On Tue, Apr 28, 2026 at 08:56:36AM +0200, Michal Hocko wrote:
> On Mon 27-04-26 16:05:04, Minchan Kim wrote:
> > On Mon, Apr 27, 2026 at 07:15:39PM +0200, Michal Hocko wrote:
> > > On Mon 27-04-26 09:48:28, Suren Baghdasaryan wrote:
> > > > On Mon, Apr 27, 2026 at 12:16 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
> > > > >
> > > > > On Fri 24-04-26 12:15:18, Minchan Kim wrote:
> > > > > > On Fri, Apr 24, 2026 at 09:57:16AM +0200, David Hildenbrand (Arm) wrote:
> > > > > > > On 4/24/26 09:51, Michal Hocko wrote:
> > > > > > > > On Tue 21-04-26 16:02:38, Minchan Kim wrote:
> > > > > > > >> For the process_mrelease reclaim, skip LRU handling for exclusive
> > > > > > > >> file-backed folios since they will be freed soon so pointless
> > > > > > > >> to move around in the LRU.
> > > > > > > >>
> > > > > > > >> This avoids costly LRU movement which accounts for a significant portion
> > > > > > > >> of the time during unmap_page_range.
> > > > > > > >>
> > > > > > > >> - 91.31% 0.00% mmap_exit_test [kernel.kallsyms] [.] exit_mm
> > > > > > > >> exit_mm
> > > > > > > >> __mmput
> > > > > > > >> exit_mmap
> > > > > > > >> unmap_vmas
> > > > > > > >> - unmap_page_range
> > > > > > > >> - 55.75% folio_mark_accessed
> > > > > > > >> + 48.79% __folio_batch_add_and_move
> > > > > > > >> 4.23% workingset_activation
> > > > > > > >> + 12.94% folio_remove_rmap_ptes
> > > > > > > >> + 9.86% page_table_check_clear
> > > > > > > >> + 3.34% tlb_flush_mmu
> > > > > > > >> 1.06% __page_table_check_pte_clear
> > > > > > > >>
> > > > > > > >> Signed-off-by: Minchan Kim <minchan@xxxxxxxxxx>
> > > > > > > >
> > > > > > > > As pointed out in the previous version of the patch. I really dislike
> > > > > > > > this to be mrelease or OOM specific. Behavior. You do not explain why
> > > > > > > > this needs to be this way, except for the performance reasons. My main
> > > > > > > > question is still unanswered (and NAK before this is sorted out). Why
> > > > > > > > this cannot be applied in general for _any_ exiting task. As you argue
> > > > > > > > the memory will just likely go away so why to bother?
> > > > > > >
> > > > > > > I think there was a lengthy discussion involving Johannes from a previous series.
> > > > > > >
> > > > > > > That should be linked here indeed.
> > > > > >
> > > > > > How about this?
> > > > > >
> > > > > > mm: process_mrelease: skip LRU movement for exclusive file folios
> > > > > >
> > > > > > During process_mrelease() or OOM reaping, unmapping file-backed folios
> > > > > > spends a significant portion of CPU time in folio_mark_accessed() to
> > > > > > maintain accurate LRU state (~55% of unmap time as shown in the profile
> > > > > > below).
> > > > > >
> > > > > > This patch skips LRU handling for exclusive file-backed folios during
> > > > > > such emergency memory reclaim.
> > > > > >
> > > > > > One might ask why this optimization shouldn't be applied to any exiting
> > > > > > task in general. The reason is that for a normal, orderly exit or just
> > > > > > pure kill, it is worth paying the CPU cost to preserve the active state
> > > > > > of clean file folios in case they are reused soon. Preserving cache hits
> > > > > > is beneficial for overall system performance.
> > > > >
> > > > > This is a statement rather than an explanation. Why is it worth paying
> > > > > the cost? What is different here?
> > > > >
> > > > > > However, process_mrelease() and OOM reaping are emergency operations
> > > > > > triggered under extreme memory pressure. In these scenarios, the highest
> > > > > > priority is to recover memory as quickly as possible to avoid further
> > > > > > kills or system jank. Spending half of the unmap time on LRU maintenance
> > > > > > for pages belonging to a victim process is a bad trade-off. If speeding up
> > > > > > the victim's reclaim by avoiding LRU movement and evicting cache negatively
> > > > > > affects the workflow (due to immediate restart), it implies a sub-optimal
> > > > > > kill target selection by the userspace policy (e.g., LMKD), rather than
> > > > > > a problem in this expedited APIs.
> > > > >
> > > > > Your change effectively boils down to break aging for exclusively mapped
> > > > > file pages when those pages should have been activated. All that because
> > > > > the activation has some (batched) overhead. You argue that the overhead
> > > > > is not a good trade-off for OOM path because those pages are exclusive
> > > > > to the process and therefore they will go away after the task exits.
> > > >
> > > > I think Minchan's argument is that mm reaping occurs only in special
> > > > conditions (under high memory pressure) and for a very specific reason
> > > > (to free up memory and prevent system memory starvation). Therefore
> > > > priority in such conditions should shift towards more aggressive
> > > > memory reclaim instead of normal aging. I can see both his point and a
> > > > counter-argument that this might cause more refaults in some cases.
> > >
> > > The way I see this is that the standard memory reclaim under a heavy
> > > memory pressure would likely encounter those pages and aged them
> > > accordingly already. So this is effectivelly racing with that process
> > > and makes a potentially opposite decision.
> > > I suspect that a lack of memory reclaim, as implied by the other patch
> > > (to deal with clean page cache), is the reason why this one makes a
> > > difference in these Android deployments.
> >
> > The claim that kswapd would have already aged these pages is just an
> > assumption; it is ultimately a matter of timing. We cannot reliably
> > predict whether kswapd has processed them, nor can we know the future
> > access patterns of a dying process.
> >
> > Global system policies are not always optimal for every specific use case.
> > That is precisely why we have hinting APIs like madvise and fadvise.
> >
> > While hinting APIs can indeed conflict with global policies, a negative
> > performance impact would imply that userspace is misusing the API, not
> > that the optimization itself shouldn't exist.
> >
> > We should view process_mrelease() (and this flag) as a similar hinting
> > mechanism where userspace explicitly requests expedited, aggressive reclaim
> > for a specific target under memory pressure.
>
> This is you bending definition of what process_mrelease is. And I
> disagree. There is nothing about aggressiveness for process_mrelease.
> There are no aging assumptions. We do not have an official man page but
> this is from the initial comment introducing the syscall
> DESCRIPTION
> The process_mrelease() system call is used to free the memory of
> an exiting process.

"Free the memory of an exiting process" implies all memory, not just
anonymous. User cannot know it will free only anonymous, and I am trying to
make it work as intended by completing a symmetric reclamation path.

>
> The pidfd selects the process referred to by the PID file
> descriptor.
> (See pidfd_open(2) for further information)
>
> The flags argument is reserved for future use; currently, this
> argument must be specified as 0.
>
> Userspace oom killers are one obvious users of the interface.
>
> > > Unless I am completely wrong and misreading the whole situation this
> > > might be very Android specific change. The question is whether these
> > > side effects are generally useful for other worklods. So we really need
> > > much more explanation of the actual behavior after this change for wider
> > > variety of workloads.
> >
> > While the primary motivation comes from Android's LMKD, this optimization
> > is not active for normal workloads. It only applies to tasks that are
> > already being reaped by the OOM reaper or by process_mrelease() with the
> > special flag (via MMF_UNSTABLE).
> >
> > Therefore, it is an opt-in or emergency-only behavior that will not hurt
> > a wider variety of general workloads unless they explicitly use this
> > targeted reclaim API. Any system with a userspace killer needing fast,
> > targeted reclaim can benefit from this.
>
> But any user of this interface will see side effects of your
> implementation.
>
> Look, you haven't convinced me that you are fully aware of all the
> consequences. Your arguments are weak and you seem to be uninterested
> about usecases beyond your specific Android LMK implementation.
>
> So I am not in support of this change, same as with the page cache one.
> Again, I am NOT NAKing this patch but I do insist a) the patch
> description is damn clear about side effects and b) there is a support
> from other non-Android people using this syscall.

I am trying to isolate the new behavior under the new flag on
process_mrelease. What about this? (Need more changes from previous
feedbacks but it would be enough to show intention)