Re: [PATCH v2 6/6] mm: zswap: interrupt shrinker writeback while pagein/out IO
From: Nhat Pham
Date: Mon Jul 08 2024 - 15:17:47 EST
On Fri, Jul 5, 2024 at 7:25 PM Takero Funaki <flintglass@xxxxxxxxx> wrote:
>
> To prevent the zswap global shrinker from writing back pages
> simultaneously with IO performed for memory reclaim and faults, delay
> the writeback when zswap_store() rejects pages or zswap_load() cannot
> find entry in pool.
>
> When the zswap shrinker is running and zswap rejects an incoming page,
> simulatenous zswap writeback and the rejected page lead to IO contention
> on swap device. In this case, the writeback of the rejected page must be
> higher priority as it is necessary for actual memory reclaim progress.
> The zswap global shrinker can run in the background and should not
> interfere with memory reclaim.
Do you see this problem actually manifesting in real life? Does not
sound infeasible to me, but I wonder how likely this is the case.
Do you have any userspace-visible metrics, or any tracing logs etc.
that proves that it actually happens?
This might also affect the dynamic shrinker as well FWIW.
>
> The same logic applies to zswap_load(). When zswap cannot find requested
> page from pool and read IO is performed, shrinker should be interrupted.
>
> To avoid IO contention, save the timestamp jiffies when zswap cannot
> buffer the pagein/out IO and interrupt the global shrinker. The shrinker
> resumes the writeback in 500 msec since the saved timestamp.
>
> Signed-off-by: Takero Funaki <flintglass@xxxxxxxxx>
> ---
> mm/zswap.c | 47 +++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 45 insertions(+), 2 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index def0f948a4ab..59ba4663c74f 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -35,6 +35,8 @@
> #include <linux/pagemap.h>
> #include <linux/workqueue.h>
> #include <linux/list_lru.h>
> +#include <linux/delay.h>
> +#include <linux/jiffies.h>
>
> #include "swap.h"
> #include "internal.h"
> @@ -176,6 +178,14 @@ static bool zswap_next_shrink_changed;
> static struct work_struct zswap_shrink_work;
> static struct shrinker *zswap_shrinker;
>
> +/*
> + * To avoid IO contention between pagein/out and global shrinker writeback,
> + * track the last jiffies of pagein/out and delay the writeback.
> + * Default to 500msec in alignment with mq-deadline read timeout.
If there is a future version, could you include the reason why you
select 500msec in the patch's changelog as well?
> + */
> +#define ZSWAP_GLOBAL_SHRINKER_DELAY_MS 500
> +static unsigned long zswap_shrinker_delay_start;
> +
> /*
> * struct zswap_entry
> *
> @@ -244,6 +254,14 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
> pr_debug("%s pool %s/%s\n", msg, (p)->tfm_name, \
> zpool_get_type((p)->zpools[0]))
>
> +static inline void zswap_shrinker_delay_update(void)
> +{
> + unsigned long now = jiffies;
> +
> + if (now != zswap_shrinker_delay_start)
> + zswap_shrinker_delay_start = now;
> +}
Hmmm is there a reason why we do not just do:
zswap_shrinker_delay_start = jiffies;
or, more unnecessarily:
unsigned long now = jiffies;
zswap_shrinker_delay_start = now;
IOW, why the branching here? Does not seem necessary to me, but
perhaps there is a correctness/compiler reason I'm not seeing?
In fact, if it's the first version, then we could manually inline it.
Additionally/alternatively, I wonder if it is more convenient to do it
at the mm/page_io.c zswap callsites, i.e whenever zswap_store() and
zswap_load() returns false, then delay the shrinker before proceeding
with the IO steps.
> +
> /*********************************
> * pool functions
> **********************************/
> @@ -1378,6 +1396,8 @@ static void shrink_worker(struct work_struct *w)
> struct mem_cgroup *memcg;
> int ret, failures = 0, progress;
> unsigned long thr;
> + unsigned long now, sleepuntil;
> + const unsigned long delay = msecs_to_jiffies(ZSWAP_GLOBAL_SHRINKER_DELAY_MS);
>
> /* Reclaim down to the accept threshold */
> thr = zswap_accept_thr_pages();
> @@ -1405,6 +1425,21 @@ static void shrink_worker(struct work_struct *w)
> * until the next run of shrink_worker().
> */
> do {
> + /*
> + * delay shrinking to allow the last rejected page completes
> + * its writeback
> + */
> + sleepuntil = delay + READ_ONCE(zswap_shrinker_delay_start);
I assume we do not care about racy access here right? Same goes for
updates - I don't see any locks protecting these operations (but I
could be missing something).
> + now = jiffies;
> + /*
> + * If zswap did not reject pages for long, sleepuntil-now may
> + * underflow. We assume the timestamp is valid only if
> + * now < sleepuntil < now + delay + 1
> + */
> + if (time_before(now, sleepuntil) &&
> + time_before(sleepuntil, now + delay + 1))
> + fsleep(jiffies_to_usecs(sleepuntil - now));
> +
> spin_lock(&zswap_shrink_lock);
>
> /*
> @@ -1526,8 +1561,10 @@ bool zswap_store(struct folio *folio)
> VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
>
> /* Large folios aren't supported */
> - if (folio_test_large(folio))
> + if (folio_test_large(folio)) {
> + zswap_shrinker_delay_update();
> return false;
> + }
>
> if (!zswap_enabled)
> goto check_old;
> @@ -1648,6 +1685,8 @@ bool zswap_store(struct folio *folio)
> zswap_entry_cache_free(entry);
> reject:
> obj_cgroup_put(objcg);
> + zswap_shrinker_delay_update();
> +
> if (need_global_shrink)
> queue_work(shrink_wq, &zswap_shrink_work);
> check_old:
> @@ -1691,8 +1730,10 @@ bool zswap_load(struct folio *folio)
> else
> entry = xa_load(tree, offset);
>
> - if (!entry)
> + if (!entry) {
> + zswap_shrinker_delay_update();
> return false;
> + }
>
> if (entry->length)
> zswap_decompress(entry, page);
> @@ -1835,6 +1876,8 @@ static int zswap_setup(void)
> if (ret)
> goto hp_fail;
>
> + zswap_shrinker_delay_update();
> +
> shrink_wq = alloc_workqueue("zswap-shrink",
> WQ_UNBOUND, 1);
> if (!shrink_wq)
> --
> 2.43.0
>