Re: [PATCH] mm: Require LRU reclaim progress before retrying direct reclaim

From: Vlastimil Babka (SUSE)

Date: Mon Apr 20 2026 - 05:13:50 EST


On 4/15/26 11:11, Matt Fleming wrote:
> On Mon, Apr 13, 2026 at 05:38:19PM +0200, Vlastimil Babka (SUSE) wrote:
>>
>> Hi Matt,
>>
>> so have you tested it for your usecase with zram and have any observations
>> how it helped, what values did you set etc?
>
> Hey Vlastimil,
>
> Yeah I've tested this out. So far, results have been positive -- I see
> system-wide OOM kills when memory is low and direct reclaim occurs, but
> not so many OOM kills that the SRE folks have started screaming at me.

Hmm...

> I've only run with the proposed 1% value so far. I also ran a bunch of
> benchmarks alongside a memory hogging app that peridoically touches
> anoymous memory.
>
> Workload rpp=0 rpp=1 Notes
> ----------------------------------------------------------------------------------------------
> Kernel compile + anon hog Completed, no OOM Completed, Global OOM confirmed from
> Global OOM fired __alloc_pages_slowpath

Completed in both cases... but was it faster? Also what got OOM killed, the hog?

>
> Memcached + anon hog 282k / 2.30M ops/s 562k / 3.53M ops/s Global OOM killed hog,
> No OOM Global OOM fired then benchmark ran faster

The improvement is nice. However even in the rpp=0 case there didn't seem to
have been a thrashing so bad the system wouldn't recover.

I think this is minimally an argument against having it enabled by default,
as by default we don't want to cause premature OOMs if the system is still
working (And yes, we do have problems to recognize when it's not working,
and actually doing OOM). But these tradeoffs for killing something to get
better throughput on something else are good for certain kind of
servers/workloads but not as a default.

And once you go that way then you might be better of looking at the PSI
metrics that would be more holistic than this heuristic?

> Pure fio (5 reruns each) median 3710 MiB/s median 3702 MiB/s No reproducible regression
> Mixed fio + anon hog 2747 MiB/s 2915 MiB/s Global OOM killed
> unrelated services
>
> reclaim_progress_pct=1 seems to help in these memory exhausted
> situations, and doesn't appear to cause a regression for the pure file
> workload case.
>
> If you have any suggestions for other tests or benchmarks to run I'd be
> happy to do that.
>
> Thanks,
> Matt