Re: [PATCH 1/2] zswap: implement a second chance algorithm for dynamic zswap shrinker

From: Yosry Ahmed
Date: Tue Jul 30 2024 - 14:47:07 EST


On Mon, Jul 29, 2024 at 8:39 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
>
> On Fri, Jul 26, 2024 at 02:58:14PM -0700, Yosry Ahmed wrote:
> > On Thu, Jul 25, 2024 at 4:28 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
> > >
> > > Current zswap shrinker's heursitics to prevent overshrinking is brittle
> > > and inaccurate, specifically in the way we decay the protection size
> > > (i.e making pages in the zswap LRU eligible for reclaim).
> >
> > Thanks for working on this and experimenting with different
> > heuristics. I was not a huge fan of these, so I am glad we are trying
> > to replace them with something more intuitive.
> >
> > >
> > > We currently decay protection aggressively in zswap_lru_add() calls.
> > > This leads to the following unfortunate effect: when a new batch of
> > > pages enter zswap, the protection size rapidly decays to below 25% of
> > > the zswap LRU size, which is way too low.
> > >
> > > We have observed this effect in production, when experimenting with the
> > > zswap shrinker: the rate of shrinking shoots up massively right after a
> > > new batch of zswap stores. This is somewhat the opposite of what we want
> > > originally - when new pages enter zswap, we want to protect both these
> > > new pages AND the pages that are already protected in the zswap LRU.
> > >
> > > Replace existing heuristics with a second chance algorithm
> > >
> > > 1. When a new zswap entry is stored in the zswap pool, its reference bit
> > > is set.
> > > 2. When the zswap shrinker encounters a zswap entry with the reference
> > > bit set, give it a second chance - only flips the reference bit and
> > > rotate it in the LRU.
> > > 3. If the shrinker encounters the entry again, this time with its
> > > reference bit unset, then it can reclaim the entry.
> >
> > At the first look, this is similar to the reclaim algorithm. A
> > fundamental difference here is that the reference bit is only set
> > once, when the entry is created. It is different from the conventional
> > second chance page reclaim/replacement algorithm.
> >
> > What this really does, is that it slows down writeback by enforcing
> > that we need to iterate entries exactly twice before we write them
> > back. This sounds a little arbitrary and not very intuitive to me.
>
> This isn't different than other second chance algorithms. Those
> usually set the reference bit again to buy the entry another round. In
> our case, another reference causes a zswapin, which removes the entry
> from the list - buying it another round. Entries will get reclaimed
> once the scan rate catches up with the longest reuse distance.
>
> The main goal, which was also the goal of the protection math, is to
> slow down writebacks in proportion to new entries showing up. This
> gives zswap a chance to solve memory pressure through compression. If
> memory pressure persists, writeback should pick up.
>
> If no new entries were to show up, then sure, this would be busy
> work. In practice, new entries do show up at a varying rate. This all
> happens in parallel to anon reclaim, after all. The key here is that
> new entries will be interleaved with rotated entries, and they consume
> scan work! This is what results in the proportional slowdown.

The fact that in practice there will be new entries showing up because
we are doing anon reclaim is exactly what I had in mind. My point is
that in practice, for most cases, the reference bit could just be
causing a steady slowdown, which can be achieved just by tuning the
seek. But I have a better idea now about what you mean with the
proportional slowdown.

(more below)

>
> > Taking a step back, what we really want is to writeback zswap entries
> > in order, and avoid writing back more entries than needed. I think the
> > key here is "when needed", which is defined by how much memory
> > pressure we have. The shrinker framework should already be taking this
> > into account.
> >
> > Looking at do_shrink_slab(), in the case of zswap (seek = 2),
> > total_scan should boil down to:
> >
> > total_scan = (zswap_shrinker_count() * 2 + nr_deferred) >> priority
> >
> > , and this is bounded by zswap_shrinker_count() * 2.
> >
> > Ignoring nr_deferred, we start by scanning 1/2048th of
> > zswap_shrinker_count() at DEF_PRIORITY, then we work our way to 2 *
> > zswap_shrinker_count() at zero priority (before OOMs). At
> > NODE_RECLAIM_PRIORITY, we start at 1/8th of zswap_shrinker_count().
> >
> > Keep in mind that zswap_shrinker_count() does not return the number of
> > all zswap entries, it subtracts the protected part (or recent swapins)
> > and scales by the compression ratio. So this looks reasonable at first
> > sight, perhaps we want to tune the seek to slow down writeback if we
> > think it's too much, but that doesn't explain the scenario you are
> > describing.
> >
> > Now let's factor in nr_deferred, which looks to me like it could be
> > the culprit here. I am assuming the intention is that if we counted
> > freeable slab objects before but didn't get to free them, we should do
> > it the next time around. This feels like it assumes that the objects
> > will remain there unless reclaimed by the shrinker. This does not
> > apply for zswap, because the objects can be swapped in.
>
> Hm.
>
> _count() returns (objects - protected) * compression_rate, then the
> shrinker does the >> priority dance. So to_scan is expected to be a
> small portion of unprotected objects.
>
> _scan() bails if to_scan > (objects - protected).
>
> How often does this actually abort in practice?

Good question :)

>
> > Also, in the beginning, before we encounter too many swapins, the
> > protection will be very low, so zswap_shrinker_count() will return a
> > relatively high value. Even if we don't scan and writeback this
> > amount, we will keep carrying this value forward in next reclaim
> > operations, even if the number of existing zswap entries have
> > decreased due to swapins.
> >
> > Could this be the problem? The number of deferred objects to be
> > scanned just keeps going forward as a high value, essentially
> > rendering the heuristics in zswap_shrinker_count() useless?
> >
> > If we just need to slow down writeback by making sure we scan entries
> > twice, could something similar be achieved just by tuning the seek
> > without needing any heuristics to begin with?
>
> Seek is a fixed coefficient for the scan rate.
>
> We want to slow writeback when recent zswapouts dominate the zswap
> pool (expanding or thrashing), and speed it up when recent entries
> make up a small share of the pool (stagnating).
>
> This is what the second chance accomplishes.

Yeah I am more convinced now, I wanted to check if just tuning the
seek achieves a similar result, but the complexity from the reference
bit is not much anyway -- so maybe it isn't worth it.

I still think the nr_deferred handling in the shrinker framework does
not seem compatible with zswap. Entries could be double counted, and
previously counted entries could go away during swapins. Unless I am
missing something, it seems like the scanning can be arbitrary, and
not truly proportional to the reclaim priority and zswap LRUs size.

I also think it's important to clarify that there are two mechanisms
that control the writeback rate with this patch, the reference bit
proportional slowdown (protecting against writeback of recently
swapped out pages), and nr_swapins protection (feedback from swapins
of recently written back pages).

Maybe we can move things around to make it more obvious how these
mechanisms work hand in hand, or have a comment somewhere describing
the writeback mechanism.