Re: [RFC PATCH 0/1] mm: Reduce direct reclaim stalls with RAM-backed swap

Next message: kernel test robot: "kernel/sched/ext.c:6509:18: sparse: sparse: symbol 'scx_bpf_reenqueue_local___v2' was not declared. Should it be static?"
Previous message: Antheas Kapenekakis: "Re: [RFC v1 0/2] platform/x86/amd: Add AMD DPTCi driver for TDP control in devices without vendor-specific controls"
In reply to: Jens Axboe: "Re: [RFC PATCH 0/1] mm: Reduce direct reclaim stalls with RAM-backed swap"
Next in thread: Matt Fleming: "Re: [RFC PATCH 0/1] mm: Reduce direct reclaim stalls with RAM-backed swap"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

From: Johannes Weiner

Date: Tue Mar 03 2026 - 14:40:38 EST

On Tue, Mar 03, 2026 at 11:53:57AM +0000, Matt Fleming wrote:
> When all active swap devices are RAM-backed, should_reclaim_retry()
> excludes anonymous pages from the reclaimable estimate and counts
> only file-backed pages. Once file pages are exhausted the watermark
> check fails and the kernel falls through to OOM.

What about when anon pages *are* reclaimable through compression,
though? Then we'd declare OOM prematurely.

You could make the case that what is reclaimable should have been
reclaimed already by the time we get here. But then you could make the
same case for file pages, and then there is nothing left.

The check is meant to be an optimization. The primary OOM cutoff is
that we aren't able to reclaim anything. This reclaimable check is a
shortcut that says, even if we are reclaiming some, there is not
enough juice in that box to keep squeezing.

Have you looked at what exactly keeps resetting no_progress_loops when
the system is in this state?

I could see an argument that the two checks are not properly aligned
right now. We could be making nominal forward progress on a small,
heavily thrashing cache position only; but we'll keep looping because,
well, look at all this anon memory! (Which isn't being reclaimed.)

If that's the case, a better solution might be to split
did_some_progress into anon and file progress, and only consider the
LRU pages for which reclaim is actually making headway. And ignore
those where we fail to succeed - for whatever reason, really, not just
this particular zram situation.

And if that isn't enough, maybe pass did_some_progress as the actual
page counts instead of a bool, and only consider an LRU type
reclaimable if the last scan cycle reclaimed at least N% of it.