RE: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)

From: wangzicheng

Date: Thu Mar 26 2026 - 03:43:38 EST




> -----Original Message-----
> From: owner-linux-mm@xxxxxxxxx <owner-linux-mm@xxxxxxxxx> On Behalf
> Of Shakeel Butt
> Sent: Thursday, March 26, 2026 5:07 AM
> To: lsf-pc@xxxxxxxxxxxxxxxxxxxxxxxxxx
> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>; Johannes Weiner
> <hannes@xxxxxxxxxxx>; David Hildenbrand <david@xxxxxxxxxx>; Michal
> Hocko <mhocko@xxxxxxxxxx>; Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx>;
> Lorenzo Stoakes <ljs@xxxxxxxxxx>; Chen Ridong
> <chenridong@xxxxxxxxxxxxxxx>; Emil Tsalapatis <emil@xxxxxxxxxxxxxxx>;
> Alexei Starovoitov <ast@xxxxxxxxxx>; Axel Rasmussen
> <axelrasmussen@xxxxxxxxxx>; Yuanchu Xie <yuanchu@xxxxxxxxxx>; Wei
> Xu <weixugc@xxxxxxxxxx>; Kairui Song <ryncsn@xxxxxxxxx>; Matthew
> Wilcox <willy@xxxxxxxxxxxxx>; Nhat Pham <nphamcs@xxxxxxxxx>; Gregory
> Price <gourry@xxxxxxxxxx>; Barry Song <21cnbao@xxxxxxxxx>; David
> Stevens <stevensd@xxxxxxxxxx>; Vernon Yang <vernon2gm@xxxxxxxxx>;
> David Rientjes <rientjes@xxxxxxxxxx>; Kalesh Singh
> <kaleshsingh@xxxxxxxxxx>; wangzicheng <wangzicheng@xxxxxxxxx>; T . J .
> Mercier <tjmercier@xxxxxxxxxx>; Baolin Wang
> <baolin.wang@xxxxxxxxxxxxxxxxx>; Suren Baghdasaryan
> <surenb@xxxxxxxxxx>; Meta kernel team <kernel-team@xxxxxxxx>;
> bpf@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
> Subject: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory
> Reclaim (reclaim_ext)
>
> The Problem
> -----------
>
> Memory reclaim in the kernel is a mess. We ship two completely separate
> eviction algorithms -- traditional LRU and MGLRU -- in the same file.
> mm/vmscan.c is over 8,000 lines. 40% of it is MGLRU-specific code that
> duplicates functionality already present in the traditional path. Every bug fix,
> every optimization, every feature has to be done twice or it only works for
> half the users. This is not sustainable. It has to stop.
>
> We should unify both algorithms into a single code path. In this path, both
> algorithms are a set of hooks called from that path. Everyone maintains,
> understands, and evolves a single codebase. Optimizations are now
> evaluated against -- and available to -- both algorithms. And the next time
> someone develops a new LRU algorithm, they can do so in a way that does
> not add churn to existing code.
>
> How We Got Here
> ---------------
>
> MGLRU brought interesting ideas -- multi-generation aging, page table
> scanning, Bloom filters, spatial lookaround. But we never tried to refactor the
> existing reclaim code or integrate these mechanisms into the traditional path.
> 3,300 lines of code were dumped as a completely parallel implementation
> with a runtime toggle to switch between the two.
> No attempt to evolve the existing code or share mechanisms between the
> two paths -- just a second reclaim system bolted on next to the first.
>
> To be fair, traditional reclaim is not easy to refactor. It has accumulated
> decades of heuristics trying to work for every workload, and touching any of
> it risks regressions. But difficulty is not an excuse.
> There was no justification for not even trying -- not attempting to generalize
> the existing scanning path, not proposing shared abstractions, not offering
> the new mechanisms as improvements to the code that was already there.
> Hard does not mean impossible, and the cost of not trying is what we are
> living with now.
>
> The Differences That Matter
> ---------------------------
>
> The two algorithms differ in how they classify pages, detect access, and
> decide what to evict. But most of these differences are not fundamental
> -- they are mechanisms that got trapped inside one implementation when
> they could benefit both. Not making those mechanisms shareable leaves
> potential free performance gains on the table.
>
> Access detection: Traditional LRU walks reverse mappings (RMAP) from the
> page back to its page table entries. MGLRU walks page tables forward,
> scanning process address spaces directly. Neither approach is inherently tied
> to its eviction policy. Page table scanning would benefit traditional LRU just as
> much -- it is cache-friendly, batches updates without the LRU lock, and
> naturally exploits spatial locality. There is no reason this should be MGLRU-
> only.
>
> Bloom filters and lookaround: MGLRU uses Bloom filters to skip cold page
> table regions and a lookaround optimization to scan adjacent PTEs during
> eviction. These are general-purpose optimizations for any scanning path.
> They are locked inside MGLRU today for no good reason.
>
> Lock-free age updates: MGLRU updates folio age using atomic flag
> operations, avoiding the LRU lock during scanning. Traditional reclaim can use
> the same technique to reduce lock contention.
>
> Page classification: Traditional LRU uses two buckets (active/inactive).
> MGLRU uses four generations with timestamps and reference frequency
> tiers. This is the policy difference -- how many age buckets and how pages
> move between them. Every other mechanism is shareable.
>
> Both systems already share the core reclaim mechanics -- writeback,
> unmapping, swap, NUMA demotion, and working set tracking. The shareable
> mechanisms listed above should join that common core. What remains after
> that is a thin policy layer -- and that is all that should differ between
> algorithms.
>
> The Fix: One Reclaim, Pluggable and Extensible
> -----------------------------------------------
>
> We need one reclaim system, not two. One code path that everyone
> maintains, everyone tests, and everyone benefits from. But it needs to be
> pluggable as there will always be cases where someone wants some
> customization for their specialized workload or wants to explore some new
> techniques/ideas, and we do not want to get into the current mess again.
>
> The unified reclaim must separate mechanism from policy. The mechanisms
> -- writeback, unmapping, swap, NUMA demotion, workingset tracking -- are
> shared today and should stay shared. The policy decisions -- how to detect
> access, how to classify pages, which pages to evict, when to protect a page --
> are where the two algorithms differ, and where future algorithms will differ
> too. Make those pluggable.
>
> This gives us one maintained code path with the flexibility to evolve.
> New ideas get implemented as new policies, not as 3,000-line forks. Good
> mechanisms from MGLRU (page table scanning, Bloom filters, lookaround)
> become shared infrastructure available to any policy. And if someone comes
> up with a better eviction algorithm tomorrow, they plug it in without
> touching the core.
>
> Making reclaim pluggable implies we define it as a set of function methods
> (let's call them reclaim_ops) hooking into a stable codebase we rarely modify.
> We then have two big questions to answer: how do these reclaim ops look,
> and how do we move the existing code to the new model?
>
> How Do We Get There
> -------------------
>
> Do we merge the two mechanisms feature by feature, or do we prioritize
> moving MGLRU to the pluggable model then follow with LRU once we are
> happy with the result?
>
> Whichever option we choose, we do the work in small, self-contained phases.
> Each phase ships independently, each phase makes the code better, each
> phase is bisectable. No big bang. No disruption. No excuses.
>
> Option A: Factor and Merge
>
> MGLRU is already pretty modular. However, we do not know which
> optimizations are actually generic and which ones are only useful for MGLRU
> itself.
>
> Phase 1 -- Factor out just MGLRU into reclaim_ops. We make no functional
> changes to MGLRU. Traditional LRU code is left completely untouched at this
> stage.
>
> Phase 2 -- Merge the two paths one method at a time. Right now the code
> diverts control to MGLRU from the very top of the high-level hooks. We
> instead unify the algorithms starting from the very beginning of LRU and
> deciding what to keep in common code and what to move into a traditional
> LRU path.
>
> Advantages:
> - We do not touch LRU until Phase 2, avoiding churn.
> - Makes it easy to experiment with combining MGLRU features into
> traditional LRU. We do not actually know which optimizations are
> useful and which should stay in MGLRU hooks.
>
> Disadvantages:
> - We will not find out whether reclaim_ops exposes the right methods
> until we merge the paths at the end. We will have to change the ops
> if it turns out we need a different split. The reclaim_ops API will
> be private and have a single user so it is not that bad, but it may
> require additional changes.
>
> Option B: Merge and Factor
>
> Phase 1 -- Extract MGLRU mechanisms into shared infrastructure. Page table
> scanning, Bloom filter PMD skipping, lookaround, lock-free folio age updates.
> These are independently useful. Make them available to both algorithms.
> Stop hoarding good ideas inside one code path.
>
> Phase 2 -- Collapse the remaining differences. Generalize list infrastructure
> to N classifications (trad=2, MGLRU=4). Unify eviction entry points. Common
> classification/promotion interface. At this point the two "algorithms" are thin
> wrappers over shared code.
>
> Phase 3 -- Define the hook interface. Define reclaim_ops around the
> remaining policy differences. Layer BPF on top (reclaim_ext).
> Traditional LRU and MGLRU become two instances of the same interface.
> Adding a third algorithm means writing a new set of hooks, not forking
> 3,000 lines.
>
> Advantages:
> - We get signals on what should be shared earlier. We know every shared
> method to be useful because we use it for both algorithms.
> - Can test LRU optimizations on MGLRU early.
>
> Disadvantages:
> - Slower, as we factor out both algorithms and expand reclaim_ops all
> at once.
>
> Open Questions
> --------------
>
> - Policy granularity: system-wide, per-node, or per-cgroup?
> - Mechanism/policy boundary: needs iteration; get it wrong and we
> either constrain policies or duplicate code.
> - Validation: reclaim quality is hard to measure; we need agreed-upon
> benchmarks.
> - Simplicity: the end result must be simpler than what we have today,
> not more complex. If it is not simpler, we failed.
> --
> 2.52.0
>

Hi Shakeel,

The reclaim_ops direction looks very promising. I'd be interested in the discussion.

We are particularly interested in the individual effects of several mechanisms
currently bundled in MGLRU. reclaim_ops would provide a great opportunity to
run ablation experiments, e.g. testing traditional LRU with page table scanning.

On policy granularity, it would also be interesting to see something like ``reclaim_ext''[1,2]
taking control at different levels, similar to what sched_ext does for scheduling policies.

Best,
Zicheng

[1] cache_ext: Customizing the Page Cache with eBPF
[2] PageFlex: Flexible and Efficient User-space Delegation of Linux Paging Policies with eBPF