[RFC PATCH] mm: memcg: Do not overreclaim SWAP_CLUSTER_MAX from protected memcg

From: Michal Koutný
Date: Tue May 10 2022 - 12:48:31 EST


This was observed with memcontrol selftest/new LTP test but can be also
reproduced in simplified setup of two siblings:

`parent .low=50M
` s1 .low=50M .current=50M+ε
` s2 .low=0M .current=50M

The expectation is that s2/memory.events:low will be zero under outer
reclaimer since no protection should be given to cgroup s2 (even with
memory_recursiveprot).

However, this does not happen. The apparent reason is that when s1 is
considered for (proportional) reclaim the scanned proportion is rounded
up to SWAP_CLUSTER_MAX and slightly over-proportional amount is
reclaimed. Consequently, when the effective low value of s2 is
calculated, it observes unclaimed parent's protection from s1
(ε-SWAP_CLUSTER_MAX in theory) and effectively appropriates it.

What is worse, when the sibling s2 has more (memory) greedy workload, it
can repeatedly "steal" the protection from s1 and the distribution ends
up with s1 mostly reclaimed despite explicit prioritization over s2.

Simply fix it by _not_ rounding up to SWAP_CLUSTER_MAX. This would have
saved us ~5 levels of reclaim priority. I.e. we may be reclaiming from
protected memcgs at relatively low priority _without_ counting any
memory.events:low (due to overreclaim). Now, if the moderated scan is
not enough, we must bring priority to zero to open protected reserves.
And that's correct, we want to be explicit when reclaiming those.


Fixes: 8a931f801340 ("mm: memcontrol: recursive memory.low protection")
Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
Reported-by: Richard Palethorpe <rpalethorpe@xxxxxxxx>
Link: https://lore.kernel.org/all/20220321101429.3703-1-rpalethorpe@xxxxxxxx/
Signed-off-by: Michal Koutný <mkoutny@xxxxxxxx>
---
mm/vmscan.c | 7 -------
1 file changed, 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1678802e03e7..cd760842b9ad 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2798,13 +2798,6 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,

scan = lruvec_size - lruvec_size * protection /
(cgroup_size + 1);
-
- /*
- * Minimally target SWAP_CLUSTER_MAX pages to keep
- * reclaim moving forwards, avoiding decrementing
- * sc->priority further than desirable.
- */
- scan = max(scan, SWAP_CLUSTER_MAX);
} else {
scan = lruvec_size;
}
--
2.35.3