Re: [PATCH] Revert "mm:vmscan: fix inaccurate reclaim during proactive reclaim"
From: T.J. Mercier
Date: Wed Jan 24 2024 - 13:14:42 EST
On Tue, Jan 23, 2024 at 8:48 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
>
> The revert isn't a straight-forward solution.
>
> The patch you're reverting fixed conventional reclaim and broke
> MGLRU. Your revert fixes MGLRU and breaks conventional reclaim.
>
> On Tue, Jan 23, 2024 at 05:58:05AM -0800, T.J. Mercier wrote:
> > They both are able to make progress. The main difference is that a
> > single iteration of try_to_free_mem_cgroup_pages with MGLRU ends soon
> > after it reclaims nr_to_reclaim, and before it touches all memcgs. So
> > a single iteration really will reclaim only about SWAP_CLUSTER_MAX-ish
> > pages with MGLRU. WIthout MGLRU the memcg walk is not aborted
> > immediately after nr_to_reclaim is reached, so a single call to
> > try_to_free_mem_cgroup_pages can actually reclaim thousands of pages
> > even when sc->nr_to_reclaim is 32. (I.E. MGLRU overreclaims less.)
> > https://lore.kernel.org/lkml/20221201223923.873696-1-yuzhao@xxxxxxxxxx/
>
> Is that a feature or a bug?
Feature!
> * 1. Memcg LRU only applies to global reclaim, and the round-robin incrementing
> * of their max_seq counters ensures the eventual fairness to all eligible
> * memcgs. For memcg reclaim, it still relies on mem_cgroup_iter().
>
> If it bails out exactly after nr_to_reclaim, it'll overreclaim
> less. But with steady reclaim in a complex subtree, it will always hit
> the first cgroup returned by mem_cgroup_iter() and then bail. This
> seems like a fairness issue.
Right. Because the memcg LRU is maintained in pg_data_t and not in
each cgroup, I think we are currently forced to have the iteration
across all child memcgs for non-root memcg reclaim for fairness.
> We should figure out what the right method for balancing fairness with
> overreclaim is, regardless of reclaim implementation. Because having
> two different approaches and reverting dependent things back and forth
> doesn't make sense.
>
> Using an LRU to rotate through memcgs over multiple reclaim cycles
> seems like a good idea. Why is this specific to MGLRU? Shouldn't this
> be a generic piece of memcg infrastructure?
It would be pretty sweet if it were. I haven't tried to measure this
part in isolation, but I know we had to abandon attempts to use
per-app memcgs in the past (2018?) because the perf overhead was too
much. In recent tests where this feature is used, I see some perf
gains which I think are probably attributable to this.
> Then there is the question of why there is an LRU for global reclaim,
> but not for subtree reclaim. Reclaiming a container with multiple
> subtrees would benefit from the fairness provided by a container-level
> LRU order just as much; having fairness for root but not for subtrees
> would produce different reclaim and pressure behavior, and can cause
> regressions when moving a service from bare-metal into a container.
>
> Figuring out these differences and converging on a method for cgroup
> fairness would be the better way of fixing this. Because of the
> regression risk to the default reclaim implementation, I'm inclined to
> NAK this revert.
In the meantime, instead of a revert how about changing the batch size
geometrically instead of the SWAP_CLUSTER_MAX constant:
reclaimed = try_to_free_mem_cgroup_pages(memcg,
- min(nr_to_reclaim -
nr_reclaimed, SWAP_CLUSTER_MAX),
+ (nr_to_reclaim - nr_reclaimed)/2,
GFP_KERNEL, reclaim_options);
I think that should address the overreclaim concern (it was mentioned
that the upper bound of overreclaim was 2 * request), and this should
also increase the reclaim rate for root reclaim with MGLRU closer to
what it was before.