Re: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)

From: Tal Zussman

Date: Fri Mar 27 2026 - 15:14:16 EST

On 3/26/26 11:43 PM, Matthew Wilcox wrote:
> On Thu, Mar 26, 2026 at 01:47:43PM -0700, Axel Rasmussen wrote:
>> On Thu, Mar 26, 2026 at 1:30 PM Gregory Price <gourry@xxxxxxxxxx> wrote:
>> >
>> > On Thu, Mar 26, 2026 at 01:02:02PM -0700, Axel Rasmussen wrote:
>> > >
>> > > I think one thing we all agree on at least is, long term, there isn't
>> > > really a good argument for having > 1 LRU implementation. E.g., we
>> > > don't believe there are just irreconcilable differences, where one
>> > > impl is better for some workloads, and another is better for others,
>> > > and there is no way the two can be converged.
>> > >
>> >
>> > I absolutely believe there are irreconcilable differences - but not in
>> > the sense that one is better or worse, but in the sense that features
>> > from one cannot work in the other.
>>
>> Right, agreed. I mean a case where we have workloads A and B, such
>> that there does not exist an implementation that can serve both well.
>> If such workloads were "common" to me that would justify a reclaim_ops
>> / pluggable abstraction layer. My thesis is that they are "not
>> common", so I'm a bit skeptical the abstraction is worth it.
>
> That isn't what Tal was telling me at Plumbers. Adding him to cc so
> he can dispute you in his own words, rather than my clumsy paraphrasing
> of what he said.
>

Yeah, unfortunately it's not so straightforward. As a simple illustrative
example, consider a file-search workload, where you search through a large
number of files over and over again (e.g., a poor kernel developer trying to
understand how the page cache works). This follows an MRU, rather than LRU,
pattern, and readahead doesn't help much, leading the active/inactive and
MGLRU policies to have similar performance (~40s runtime in a specific
benchmark we ran). In comparison, using cache_ext (our eBPF-based caching
framework), we can run an MRU policy and it goes down to 20s.

This is also true for more complex workloads, like an HTAP-like database
workload, where we have lots of small GET-like requests and a few large SCAN
requests. We find that the SCAN requests pollute the cache for both
policies, leading to eviction of the (small) GET data and degrading
performance. fadvise() doesn't help much, but using a custom policy that
separates the GET and SCAN data into two different queues can yield a 70%
throughput increase for GET requests. This was tested on Linux v6.6, so it's
possible that various MGLRU behaviors have been improved, but the underlying
structure remains.

It's been well-known in the academic realm for a while that there isn't
really a "one-size-fits-all" policy that works *best* for all workloads.
Yes, you can make a general policy that works *well*, but if you really care
about a workload's performance and want to squeeze out the last 10-20% (or
more) of performance, you need to be able to (1) experiment and (2) take
advantage of application-level insights. Being able to extend reclaim (in
our case with eBPF) enables that.

We wrote a paper about this that was published a few months ago [1]. Happy
to answer any questions and continue the discussion!

Thanks,
Tal

[1] https://dl.acm.org/doi/pdf/10.1145/3731569.3764820