Re: [PATCH] zswap: do not shrink when memory.zswap.max is 0

From: Johannes Weiner
Date: Tue May 30 2023 - 16:59:57 EST


On Tue, May 30, 2023 at 01:19:12PM -0700, Yosry Ahmed wrote:
> On Tue, May 30, 2023 at 12:13 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> >
> > On Tue, May 30, 2023 at 11:41:32AM -0700, Yosry Ahmed wrote:
> > > On Tue, May 30, 2023 at 11:00 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> > > >
> > > > On Tue, May 30, 2023 at 09:52:36AM -0700, Yosry Ahmed wrote:
> > > > > On Tue, May 30, 2023 at 9:22 AM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
> > > > > >
> > > > > > Before storing a page, zswap first checks if the number of stored pages
> > > > > > exceeds the limit specified by memory.zswap.max, for each cgroup in the
> > > > > > hierarchy. If this limit is reached or exceeded, then zswap shrinking is
> > > > > > triggered and short-circuits the store attempt.
> > > > > >
> > > > > > However, if memory.zswap.max = 0 for a cgroup, no amount of writeback
> > > > > > will allow future store attempts from processes in this cgroup to
> > > > > > succeed. Furthermore, this create a pathological behavior in a system
> > > > > > where some cgroups have memory.zswap.max = 0 and some do not: the
> > > > > > processes in the former cgroups, under memory pressure, will evict pages
> > > > > > stored by the latter continually, until the need for swap ceases or the
> > > > > > pool becomes empty.
> > > > > >
> > > > > > As a result of this, we observe a disproportionate amount of zswap
> > > > > > writeback and a perpetually small zswap pool in our experiments, even
> > > > > > though the pool limit is never hit.
> > > > > >
> > > > > > This patch fixes the issue by rejecting zswap store attempt without
> > > > > > shrinking the pool when memory.zswap.max is 0.
> > > > > >
> > > > > > Fixes: f4840ccfca25 ("zswap: memcg accounting")
> > > > > > Signed-off-by: Nhat Pham <nphamcs@xxxxxxxxx>
> > > > > > ---
> > > > > > include/linux/memcontrol.h | 6 +++---
> > > > > > mm/memcontrol.c | 8 ++++----
> > > > > > mm/zswap.c | 9 +++++++--
> > > > > > 3 files changed, 14 insertions(+), 9 deletions(-)
> > > > > >
> > > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > > > > index 222d7370134c..507bed3a28b0 100644
> > > > > > --- a/include/linux/memcontrol.h
> > > > > > +++ b/include/linux/memcontrol.h
> > > > > > @@ -1899,13 +1899,13 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
> > > > > > #endif /* CONFIG_MEMCG_KMEM */
> > > > > >
> > > > > > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
> > > > > > -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg);
> > > > > > +int obj_cgroup_may_zswap(struct obj_cgroup *objcg);
> > > > > > void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size);
> > > > > > void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size);
> > > > > > #else
> > > > > > -static inline bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > > > +static inline int obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > > > {
> > > > > > - return true;
> > > > > > + return 0;
> > > > > > }
> > > > > > static inline void obj_cgroup_charge_zswap(struct obj_cgroup *objcg,
> > > > > > size_t size)
> > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > > index 4b27e245a055..09aad0e6f2ea 100644
> > > > > > --- a/mm/memcontrol.c
> > > > > > +++ b/mm/memcontrol.c
> > > > > > @@ -7783,10 +7783,10 @@ static struct cftype memsw_files[] = {
> > > > > > * spending cycles on compression when there is already no room left
> > > > > > * or zswap is disabled altogether somewhere in the hierarchy.
> > > > > > */
> > > > > > -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > > > +int obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > > > {
> > > > > > struct mem_cgroup *memcg, *original_memcg;
> > > > > > - bool ret = true;
> > > > > > + int ret = 0;
> > > > > >
> > > > > > if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> > > > > > return true;
> > > > > > @@ -7800,7 +7800,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > > > if (max == PAGE_COUNTER_MAX)
> > > > > > continue;
> > > > > > if (max == 0) {
> > > > > > - ret = false;
> > > > > > + ret = -ENODEV;
> > > > > > break;
> > > > > > }
> > > > > >
> > > > > > @@ -7808,7 +7808,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > > > > > pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
> > > > > > if (pages < max)
> > > > > > continue;
> > > > > > - ret = false;
> > > > > > + ret = -ENOMEM;
> > > > > > break;
> > > > > > }
> > > > > > mem_cgroup_put(original_memcg);
> > > > > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > > > > index 59da2a415fbb..7b13dc865438 100644
> > > > > > --- a/mm/zswap.c
> > > > > > +++ b/mm/zswap.c
> > > > > > @@ -1175,8 +1175,13 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
> > > > > > }
> > > > > >
> > > > > > objcg = get_obj_cgroup_from_page(page);
> > > > > > - if (objcg && !obj_cgroup_may_zswap(objcg))
> > > > > > - goto shrink;
> > > > > > + if (objcg) {
> > > > > > + ret = obj_cgroup_may_zswap(objcg);
> > > > > > + if (ret == -ENODEV)
> > > > > > + goto reject;
> > > > > > + if (ret == -ENOMEM)
> > > > > > + goto shrink;
> > > > > > + }
> > > > >
> > > > > I wonder if we should just make this:
> > > > >
> > > > > if (objcg && !obj_cgroup_may_zswap(objcg))
> > > > > goto reject;
> > > > >
> > > > > Even if memory.zswap.max is > 0, if the limit is hit, shrinking the
> > > > > zswap pool will only help if we happen to writeback a page from the
> > > > > same memcg that hit its limit. Keep in mind that we will only
> > > > > writeback one page every time we observe that the limit is hit (even
> > > > > with Domenico's patch, because zswap_can_accept() should be true).
> > > > >
> > > > > On a system with a handful of memcgs,
> > > > > it seems likely that we wrongfully writeback pages from other memcgs
> > > > > because of this. Achieving nothing for this memcg, while hurting
> > > > > others. OTOH, without invoking writeback when the limit is hit, the
> > > > > memcg will just not be able to use zswap until some pages are
> > > > > faulted back in or invalidated.
> > > > >
> > > > > I am not sure which is better, just thinking out loud.
> > > >
> > > > You're absolutely right.
> > > >
> > > > Currently the choice is writing back either everybody or nobody,
> > > > meaning between writeback and cgroup containment. They're both so poor
> > > > that I can't say I strongly prefer one over the other.
> > > >
> > > > However, I have a lame argument in favor of this patch:
> > > >
> > > > The last few fixes from Nhat and Domenico around writeback show that
> > > > few people, if anybody, are actually using writeback. So it might not
> > > > actually matter that much in practice which way we go with this patch.
> > > > Per-memcg LRUs will be necessary for it to work right.
> > > >
> > > > However, what Nhat is proposing is how we want the behavior down the
> > > > line. So between two equally poor choices, I figure we might as well
> > > > go with the one that doesn't require another code change later on.
> > > >
> > > > Doesn't that fill you with radiant enthusiasm?
> > >
> > > If we have per-memcg LRUs, and memory.zswap.max == 0, then we should
> > > be in one of two situations:
> > >
> > > (a) memory.zswap.max has always been 0, so the LRU for this memcg is
> > > empty, so we don't really need the special case for memory.zswap.max
> > > == 0.
> > >
> > > (b) memory.zswap.max was reduced to 0 at some point, and some pages
> > > are already in zswap. In this case, I don't think shrinking the memcg
> > > is such a bad idea, we would be lazily enforcing the limit.
> > >
> > > In that sense I am not sure that this change won't require another
> > > code change. It feels like special casing memory.zswap.max == 0 is
> > > only needed now due to the lack of per-memcg LRUs.
> >
> > Good point. And I agree down the line we should just always send the
> > shrinker off optimistically on the cgroup's lru list.
> >
> > So I take back my lame argument. But that then still leaves us with
> > the situation that both choices are equal here, right?
> >
> > If so, my vote would be to go with the patch as-is.
>
> I *think* it's better to punish the memcg that exceeded its limit by
> not allowing it to use zswap until its usage goes down, rather than
> punish random memcgs on the machine because one memcg hit its limit.
> It also seems to me that on a system with a handful of memcgs, it is
> statistically more likely for zswap shrinking to writeback a page from
> the wrong memcg.

Right, but in either case a hybrid zswap + swap setup with cgroup
isolation is broken anyway. Without it being usable, I'm assuming
there are no users - maybe that's optimistic of me ;)

However, if you think it's better to just be conservative about taking
action in general, that's fine by me as well.

> The code would also be simpler if obj_cgroup_may_zswap() just returns
> a boolean and we do not shrink at all if it returns false. If it no
> longer returns a boolean we should at least rename it.
>
> Did you try just not shrinking at all if the memcg limit is hit in
> your experiments?
>
> I don't feel strongly, but my preference would be to just not shrink
> at all if obj_cgroup_may_zswap() returns false.

Sounds reasonable to me. Basically just replace the goto shrink with
goto reject for now. Maybe a comment that says "XXX: Writeback/reclaim
does not work with cgroups yet. Needs a cgroup-aware entry LRU first,
or we'd push out entries system-wide based on local cgroup limits."

Nhat, does that sound good to you?