Re: [RFC PATCH 0/1] mm: Reduce direct reclaim stalls with RAM-backed swap
From: Matt Fleming
Date: Wed Mar 04 2026 - 10:51:50 EST
On Tue, Mar 03, 2026 at 02:35:12PM -0500, Johannes Weiner wrote:
>
> What about when anon pages *are* reclaimable through compression,
> though? Then we'd declare OOM prematurely.
I agree this RFC is a rather blunt approach which is why I tried to
limit it to zram/brd specifically.
> You could make the case that what is reclaimable should have been
> reclaimed already by the time we get here. But then you could make the
> same case for file pages, and then there is nothing left.
>
> The check is meant to be an optimization. The primary OOM cutoff is
> that we aren't able to reclaim anything. This reclaimable check is a
> shortcut that says, even if we are reclaiming some, there is not
> enough juice in that box to keep squeezing.
>
> Have you looked at what exactly keeps resetting no_progress_loops when
> the system is in this state?
I pulled data for some of the worst offenders atm but I couldn't catch
any in this 20-30 min brownout situation. Still, I think this
illustrates the problem...
Across three machines, every reclaim_retry_zone event showed
no_progress_loops = 0 and wmark_check = pass. On the busiest node (141
retry events over 5 minutes), the reclaimable estimate ranged from 4.8M
to 5.3M pages (19-21 GiB). The counter never incremented once.
The reclaimable watermark check also always passes. The traced
reclaimable values (19-21 GiB per zone) trivially exceed the min
watermark (~68 MiB), so should_reclaim_retry() never falls through on
that path either.
Sample output from a bpftrace script [1] on the reclaim_retry_zone
tracepoint (LOOPS = no_progress_loops, WMARK = wmark_check):
COMM PID NODE ORDER RECLAIMABLE AVAILABLE MIN_WMARK LOOPS WMARK
app1 2133536 4 0 4960156 5013010 17522 0 1
app2 2337869 5 0 4845655 4901543 17521 0 1
app3 339457 6 0 4823519 4838900 17522 0 1
app4 2179800 6 0 4819201 4835085 17522 0 1
app5 2299092 0 0 3566433 3595953 15821 0 1
app6 2194373 7 0 5612347 5626651 17521 0 1
Here are the numbers from a 5-minute bpftrace session on a node under
memory pressure:
should_reclaim_retry:
141 calls, no_progress_loops = 0 every time, wmark_check = pass every time
reclaimable estimate: 4.8M - 5.3M pages (19-21 GiB)
shrink_folio_list (mm_vmscan_lru_shrink_inactive) [2]:
anon: 52M pages reclaimed / 244M scanned (21% hit rate)
53% of scan events reclaimed zero pages
file: 33M pages reclaimed / 42M scanned (78% hit rate)
21% of scan events reclaimed zero pages
priority distribution peaked at 2-3 (most aggressive levels)
[1] https://gist.github.com/mfleming/167b00bef7e1f4e686a6d32833c42079
[2] https://gist.github.com/mfleming/e31c86d3ab0a883e9053e19010150a13
A second node showed the same pattern: 18% anon scan efficiency vs 90%
file, no_progress_loops = 0, wmark always passes.
> I could see an argument that the two checks are not properly aligned
> right now. We could be making nominal forward progress on a small,
> heavily thrashing cache position only; but we'll keep looping because,
> well, look at all this anon memory! (Which isn't being reclaimed.)
>
> If that's the case, a better solution might be to split
> did_some_progress into anon and file progress, and only consider the
> LRU pages for which reclaim is actually making headway. And ignore
> those where we fail to succeed - for whatever reason, really, not just
> this particular zram situation.
Right. The mm_vmscan_lru_shrink_inactive tracepoint shows the anon LRU
being scanned aggressively at priority 1-3, but only 21% of scanned
pages are reclaimed. Meanwhile file reclaim runs at 78-90% efficiency
but there aren't enough file pages to satisfy the allocation.
> And if that isn't enough, maybe pass did_some_progress as the actual
> page counts instead of a bool, and only consider an LRU type
> reclaimable if the last scan cycle reclaimed at least N% of it.
Nice idea. I'll work on a patch.
Thanks,
Matt