Re: [PATCH 1/2] zswap: implement a second chance algorithm for dynamic zswap shrinker

From: Nhat Pham
Date: Mon Jul 29 2024 - 19:07:52 EST


On Fri, Jul 26, 2024 at 2:58 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
>
> On Thu, Jul 25, 2024 at 4:28 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
> >
> > Current zswap shrinker's heursitics to prevent overshrinking is brittle
> > and inaccurate, specifically in the way we decay the protection size
> > (i.e making pages in the zswap LRU eligible for reclaim).
>
> Thanks for working on this and experimenting with different
> heuristics. I was not a huge fan of these, so I am glad we are trying
> to replace them with something more intuitive.

That is certainly the intention :) I'm not a huge fan of those
heuristics either - they seem fairly brittle and arbitrary to me even
back then (as you have pointed out).

>
> >
> > We currently decay protection aggressively in zswap_lru_add() calls.
> > This leads to the following unfortunate effect: when a new batch of
> > pages enter zswap, the protection size rapidly decays to below 25% of
> > the zswap LRU size, which is way too low.
> >
> > We have observed this effect in production, when experimenting with the
> > zswap shrinker: the rate of shrinking shoots up massively right after a
> > new batch of zswap stores. This is somewhat the opposite of what we want
> > originally - when new pages enter zswap, we want to protect both these
> > new pages AND the pages that are already protected in the zswap LRU.
> >
> > Replace existing heuristics with a second chance algorithm
> >
> > 1. When a new zswap entry is stored in the zswap pool, its reference bit
> > is set.
> > 2. When the zswap shrinker encounters a zswap entry with the reference
> > bit set, give it a second chance - only flips the reference bit and
> > rotate it in the LRU.
> > 3. If the shrinker encounters the entry again, this time with its
> > reference bit unset, then it can reclaim the entry.
>
> At the first look, this is similar to the reclaim algorithm. A
> fundamental difference here is that the reference bit is only set
> once, when the entry is created. It is different from the conventional
> second chance page reclaim/replacement algorithm.

I suppose, yeah. We no longer have non-exclusive loading (in most
scenarios), so the reference bit is only set once, on store attempt.

The interpretation/justification is still there though (somewhat). We
are still giving each object another "chance" to stay in the zswap
pool, in case it is needed soon, before it is reclaimed.

> Taking a step back, what we really want is to writeback zswap entries
> in order, and avoid writing back more entries than needed. I think the
> key here is "when needed", which is defined by how much memory
> pressure we have. The shrinker framework should already be taking this
> into account.
>
> Looking at do_shrink_slab(), in the case of zswap (seek = 2),
> total_scan should boil down to:
>
> total_scan = (zswap_shrinker_count() * 2 + nr_deferred) >> priority
>
> , and this is bounded by zswap_shrinker_count() * 2.
>
> Ignoring nr_deferred, we start by scanning 1/2048th of
> zswap_shrinker_count() at DEF_PRIORITY, then we work our way to 2 *
> zswap_shrinker_count() at zero priority (before OOMs). At
> NODE_RECLAIM_PRIORITY, we start at 1/8th of zswap_shrinker_count().
>
> Keep in mind that zswap_shrinker_count() does not return the number of
> all zswap entries, it subtracts the protected part (or recent swapins)

Note that this "protection" decays rather aggressively with zswap
stores. From zswap_lru_add():

new = old > lru_size / 4 ? old / 2 : old;

IOW, if the protection size exceeds 25% lru_size, half it. A couple of
zswap_stores() could easily slash this to 25% of the LRU (or even
below) rapidly.

I guess I can fiddle with this decaying attempt + decouple the
decaying from storing (i.e moving it somewhere else). But any formula
I can come up with (another multiplicative factor?) seems as arbitrary
as this formula, and any places that I place the decaying could
potentially couple two actions, which lead to unintended effect.

The second chance algorithm creates a similar two-section LRU
configuration as the old scheme - protected/unprotected (or, analogous
to the page reclaim algorithm, active/inactive). However, it bypasses
the need for such an artificially constructed formula, which you can
think of as the rate of objects aging. It naturally ties the aging of
objects (i.e moving the objects from one section to another) in the
zswap pool with memory pressure, which makes sense to me, FWIW.

> and scales by the compression ratio. So this looks reasonable at first
> sight, perhaps we want to tune the seek to slow down writeback if we
> think it's too much, but that doesn't explain the scenario you are
> describing.
>
> Now let's factor in nr_deferred, which looks to me like it could be
> the culprit here. I am assuming the intention is that if we counted
> freeable slab objects before but didn't get to free them, we should do
> it the next time around. This feels like it assumes that the objects
> will remain there unless reclaimed by the shrinker. This does not
> apply for zswap, because the objects can be swapped in.
>
> Also, in the beginning, before we encounter too many swapins, the
> protection will be very low, so zswap_shrinker_count() will return a
> relatively high value. Even if we don't scan and writeback this
> amount, we will keep carrying this value forward in next reclaim
> operations, even if the number of existing zswap entries have
> decreased due to swapins.
>
> Could this be the problem? The number of deferred objects to be
> scanned just keeps going forward as a high value, essentially
> rendering the heuristics in zswap_shrinker_count() useless?

Interesting theory. I wonder if I can do some tracing to investigate
this (or maybe just spam trace_printk() lol).

But yeah, this nr_deferred part seems suspicious. The number of
freeable objects returned by the shrinker_count() function (at least
the one implemented by zswap) doesn't really track which object is
newly freeable since the last shrinker_count() invocation. So an
object can be accounted for in a previous invocation, fail to be
reclaimed in that invocation yet still count towards future shrinker
invocation? That sounds wrong :) Even without zswapin, that's still
double counting, no? Future shrinker_count() call will still consider
the previously-failed-to-reclaim object. Do other shrinkers somehow
track the objects that have been accounted for in previous shrinker
attempts, and use this in the freeable computation?

That said, how bad it is in practice depends on the rate of reclaim
failure, since the nr_deferred only increments when we fail to scan.
Which is actually not very high - for instance, this is the statistics
from a couple of my benchmark runs (using the old zswap shrinker
protection scheme, i.e the status quo):

/sys/kernel/debug/zswap/reject_reclaim_fail:292
/sys/kernel/debug/zswap/written_back_pages:6947364

IIRC, exclusive loading really lowers the number of reclaim rejection
(you're less likely to observe a page that is already loaded into
memory and in swap cache - its zswap entry is already invalidated).

This could be a problem, and probably should be fixed (for instance,
by adding a no-defer mode for shrinkers similar to zswap), but there
seems to be an additional/orthogonal reason for overshrinking.

>
> If we just need to slow down writeback by making sure we scan entries
> twice, could something similar be achieved just by tuning the seek
> without needing any heuristics to begin with?

Ah but this is a good point. My only defence for not using seek would
be I have no clue how to interpret the seek value :) Based on the
documentation:

int seeks; /* seeks to recreate an obj */

So I'm guessing this sorta kinda represents the number of disk seeks
(or IO?) required to reconstruct the object. Which makes sense in the
special case of 0. But it seems really arbitrary otherwise -
DEFAULT_SEEKS is 2, but zswap does not really take 2 seeks/IOs to
reconstruct the object, no?

It might have been different in the past - I don't know the historical
context here. But now, I guess this is just a shorthand for writeback
pace. The second chance algorithm has a (nominal) justification that
backs it, even though in effect it's the same.

But yeah, if we only think about actual effects, perhaps we don't even
need the reference bit manipulation - just halving the writeback pace
by half is good enough. There might be some differences with
non-exclusive loading, but now that's gone (for the most part). The
other difference I can think of is in the aging-reclaiming ordering:

1. With the reference bit scheme, shrinkers will age the entire LRU
before reclaiming anything (due to LRU rotation).
2. WIth increasing seek value, we will start reclaiming right away
(albeit half of our old pace).

Average reclaim rate should stay the same, but the former gives the
objects on the zswap pool more chance to be loaded into main memory at
the beginning. Or maybe I'm overthinking this and it doesn't really
matter - benchmarking time? :)

>
> I am just trying to understand if the main problem is that zswap does
> not fit well into the shrinker framework as it is, and how we can
> improve this.

Agree. Another potential problem is under highly concurrent settings,
we can have many shrinker instances hitting the same zswap LRU. And
since the protection mechanism is racy/best-effort at best, we can
easily write back well into the protection area.

I was seriously considering physically separating the protected
(active) and unprotected (inactive) LRUs for zswap, but the second
chance algorithm seems a natural way to implement this separation, and
plays quite well with the current shrinker framework - aging and
reclaiming are all serialized via the lru lock.

> Just to be clear, I am in favor of changing those heuristics to
> something more intuitive and simpler, but I really want to understand
> what is going on. The approach taken by this patch is definitely
> simpler, but it doesn't feel more intuitive to me (at least not yet).

Thanks, yeah I agree. I was actually thinking that the second chance
algorithm would be a bit more natural than some protection decaying
formulae I pull out of thin air, but perhaps there are still problems
with it.

> If we take this approach, this needs to be placed in the hole after
> the length, to avoid increasing the size of the zswap_entry.

Very good point - I was actually thinking about this the other day,
but somehow forgot to fix it before submission :) I'll fix this.