Re: [PATCH] mm: zswap: shrink until can accept

From: Johannes Weiner
Date: Tue May 30 2023 - 22:30:29 EST


On Tue, May 30, 2023 at 06:06:38PM -0700, Chris Li wrote:
> On Tue, May 30, 2023 at 02:54:51PM -0400, Johannes Weiner wrote:
> > > Maybe ENOMEM is a bad example. How about if the swap device
> > > just went bad and can't complete new IO writes?
> >
> > This is actually outside the scope of zswap, and handled by the
> > swapcache (end_swap_bio_write).
> >
> > Once the IO is submitted, zswap will ax its copy and leave the rest to
> > the swapcache. It behaves the same way as if zswap had never been
> > involved to begin with when the swap out fails on IO errors.
> >
> > From a zswap perspective, there are no persistent errors in moving a
> > zswap entry back into the swapcache. Not just currently, but generally.
> Again, you are right that this zswap writeback is async.
> So the writeback error is NOT going to propagate to the
> shrink function.
>
> With the current three pool backends that I looked at{zbud,
> z3fold,zsmalloc} they all have internal retry 8 times.
> Adding more retry did not fundamentally change the existing
> behavior.

Ah, but they're looping over different things.

The internal loop in the zs_reclaim_page() function is about walking
the LRU list until at least one backing page is freed.

Then there is zs_zpool_shrink() which calls zs_reclaim_page() until
the requested number of pages are freed.

Finally, there is shrink_worker(), which calls zs_zpool_shrink(). It
currently calls it for a single page when woken up during a store that
hits the zswap pool limit. This is the problematic one, because zswap
is very unlikely to go back to accepting stores after one page freed.

Domenico's patch isn't adding more retries for error conditions. It
ensures the pool is shrunk back down to where it accepts stores again.

The reason that it now looks at errors as well isn't to retry over
them (that's zs_reclaim_page()'s job). It's to avoid an infinite loop
in case there is an unexpectedly high rate of errors across a whole
series of pages (suggesting there is a bug of some kind).

> I look at all the possible error codes generated inside
> the reclaim code, the only noticeable errors are ENOMEM
> and concurrent swap invalidation or a racing swapin fault.

Right.

> BTW, zswap reclaim consumes memory. Keep on looping ENOMEM
> might cause more OOM. But that can exist in current code
> as well.

Right.

And this is temporary. Zswap will allocate a page to decompress in,
add it to the swapcache and kick off the IO. Once the page is written
out, it'll be reclaimed again. So while the consumption increases
temporarily, the end result is a net reduction by the amount of
compressed data that was written back from zswap.

This is typical for other types of reclaim as well, e.g. allocating
entries in the swapcache tree, allocating bios and IO requests...

> > > > Aside from -ENOMEM, writeback_entry will fail on concurrent swap
> > > > invalidation or a racing swapin fault. In both cases we should
> > > > absolutely keep trying other entries until the goal is met.
> > >
> > > How about a narrower fix recognizing those error cases and making
> > > the inner loop continue in those errors?
> >
> > Right, I just I don't really see the value proposition of this
> > complication, and I see some downsides (see below). No single entry
> > error should ever cause us to stop the wider reclaim loop.
>
> That is until the current LRU list has been through once.
> I expect repeating the same list yields less reclaimed pages.

If we see failure due to invalidation races, the entries will free and
shrink the pool, thus getting us closer to the can_accept() condition.
We'll stop looping in the shrinker once enough entries have been freed
- whether we reclaimed them ourselves, or somebody else invalidated them.