Re: [PATCH v2] mm: zswap: shrink until can accept

From: Yosry Ahmed
Date: Fri May 26 2023 - 14:16:16 EST


On Fri, May 26, 2023 at 11:10 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
>
> On Fri, May 26, 2023 at 07:39:55PM +0200, Domenico Cerasuolo wrote:
> > This update addresses an issue with the zswap reclaim mechanism, which
> > hinders the efficient offloading of cold pages to disk, thereby
> > compromising the preservation of the LRU order and consequently
> > diminishing, if not inverting, its performance benefits.
> >
> > The functioning of the zswap shrink worker was found to be inadequate,
> > as shown by basic benchmark test. For the test, a kernel build was
> > utilized as a reference, with its memory confined to 1G via a cgroup and
> > a 5G swap file provided. The results are presented below, these are
> > averages of three runs without the use of zswap:
> >
> > real 46m26s
> > user 35m4s
> > sys 7m37s
> >
> > With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
> > system), the results changed to:
> >
> > real 56m4s
> > user 35m13s
> > sys 8m43s
> >
> > written_back_pages: 18
> > reject_reclaim_fail: 0
> > pool_limit_hit:1478
> >
> > Besides the evident regression, one thing to notice from this data is
> > the extremely low number of written_back_pages and pool_limit_hit.
> >
> > The pool_limit_hit counter, which is increased in zswap_frontswap_store
> > when zswap is completely full, doesn't account for a particular
> > scenario: once zswap hits his limit, zswap_pool_reached_full is set to
> > true; with this flag on, zswap_frontswap_store rejects pages if zswap is
> > still above the acceptance threshold. Once we include the rejections due
> > to zswap_pool_reached_full && !zswap_can_accept(), the number goes from
> > 1478 to a significant 21578266.
> >
> > Zswap is stuck in an undesirable state where it rejects pages because
> > it's above the acceptance threshold, yet fails to attempt memory
> > reclaimation. This happens because the shrink work is only queued when
> > zswap_frontswap_store detects that it's full and the work itself only
> > reclaims one page per run.
> >
> > This state results in hot pages getting written directly to disk,
> > while cold ones remain memory, waiting only to be invalidated. The LRU
> > order is completely broken and zswap ends up being just an overhead
> > without providing any benefits.
> >
> > This commit applies 2 changes: a) the shrink worker is set to reclaim
> > pages until the acceptance threshold is met and b) the task is also
> > enqueued when zswap is not full but still above the threshold.
> >
> > Testing this suggested update showed much better numbers:
> >
> > real 36m37s
> > user 35m8s
> > sys 9m32s
> >
> > written_back_pages: 10459423
> > reject_reclaim_fail: 12896
> > pool_limit_hit: 75653
> >
> > V2:
> > - loop against == -EAGAIN rather than != -EINVAL and also break the loop
> > on MAX_RECLAIM_RETRIES (thanks Yosry)
> > - cond_resched() to ensure that the loop doesn't burn the cpu (thanks
> > Vitaly)
> >
> > Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
> > Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@xxxxxxxxx>
> > ---
> > mm/zswap.c | 15 ++++++++++++---
> > 1 file changed, 12 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 59da2a415fbb..f953dceaab34 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -37,6 +37,7 @@
> > #include <linux/workqueue.h>
> >
> > #include "swap.h"
> > +#include "internal.h"
> >
> > /*********************************
> > * statistics
> > @@ -587,9 +588,17 @@ static void shrink_worker(struct work_struct *w)
> > {
> > struct zswap_pool *pool = container_of(w, typeof(*pool),
> > shrink_work);
> > + int ret, failures = 0;
> >
> > - if (zpool_shrink(pool->zpool, 1, NULL))
> > - zswap_reject_reclaim_fail++;
> > + do {
> > + ret = zpool_shrink(pool->zpool, 1, NULL);
> > + if (ret) {
> > + zswap_reject_reclaim_fail++;
> > + failures++;
> > + }
> > + cond_resched();
> > + } while (!zswap_can_accept() && ret == -EAGAIN &&
> > + failures < MAX_RECLAIM_RETRIES);
>
> It should also loop on !ret, right?
>
> AFAIU Yosry's suggestion was that instead of breaking only on -EINVAL,
> it should break on all failures but -EAGAIN. But it should still keep
> going if the shrink was successful and the pool cannot accept yet.
>
> Basically, something like this?
>
> do {
> ret = zpool_shrink(pool->zpool, 1, NULL);
> if (ret) {
> zswap_reject_reclaim_fail++;
> if (ret != -EAGAIN)
> break;
> if (++failures == MAX_RECLAIM_RETRIES)
> break;
> }
> cond_resched();
> } while (!zswap_can_accept());

Yes, that's what I meant. Otherwise if shrink is successful we end up
doing 1 page only, which is exactly what we are trying to avoid here.