Re: [RFC PATCH] mm: bypass swap readahead for zswap

From: Yosry Ahmed

Date: Wed Jun 24 2026 - 14:01:42 EST


On Wed, Jun 24, 2026 at 12:57 AM Alexandre Ghiti <alex@xxxxxxxx> wrote:
>
> Commit 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous
> device") made SWP_SYNCHRONOUS_IO devices (e.g. zram) skip swap readahead.
>
> zswap is the same kind of in-memory, synchronous backend as zram, not a
> swap device flagged SWP_SYNCHRONOUS_IO so it still goes through
> swapin_readahead().
>
> Here are the results from bypassing readahead for zswap too: it was
> measured with a kernel build (make -j16) in a memcg, zswap=zstd, shrinker
> off, on Sapphire Rapids and 3 iterations.
>
> 768M memcg (sustained swap thrash):
> metric mm-new + bypass delta
> build time (s) 405.0 341.7 -15.6%
> zswap-in (GB) 79.5 53.0 -33%
> zswap-out (GB) 144.8 115.6 -20%
> swap readahead (pages) 6.79M 0.45M -93%
> swap_ra hit (%) 72.1 89.9 +18pp
>
> 1G memcg (light pressure, build not memory-bound):
> metric mm-new + bypass delta
> build time (s) 177.7 176.0 ~same (no regression)
> zswap-in (GB) 10.2 7.5 -26%
> zswap-out (GB) 27.7 25.1 -9%
> swap readahead (pages) 1.07M 0.08M -93%
> swap_ra hit (%) 68.6 87.2 +19pp
>
> The gain is from no longer prefetching pages that are pointless for an
> in-memory backend: readahead inflates anon residency and thrashes the
> page cache (file pages get evicted and re-read), lengthens each fault by
> synchronously (de)compressing a cluster of neighbours, and adds
> compression traffic when those extra pages are reclaimed.
>
> Bypassing swap readahead for zswap therefore makes sense.
>
> Signed-off-by: Alexandre Ghiti <alex@xxxxxxxx>
> ---
>
> - This bypass originally comes from Usama's series that implements
> large folio zswapin: while working on improving this series, I noticed
> the gains I got only came from the bypass of readahead.
>
> include/linux/zswap.h | 6 ++++++
> mm/memory.c | 5 +++--
> mm/zswap.c | 11 +++++++++++
> 3 files changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> index 30c193a1207e..b6f0e6198b6f 100644
> --- a/include/linux/zswap.h
> +++ b/include/linux/zswap.h
> @@ -35,6 +35,7 @@ void zswap_lruvec_state_init(struct lruvec *lruvec);
> void zswap_folio_swapin(struct folio *folio);
> bool zswap_is_enabled(void);
> bool zswap_never_enabled(void);
> +bool zswap_present_test(swp_entry_t swp);
> #else
>
> struct zswap_lruvec_state {};
> @@ -69,6 +70,11 @@ static inline bool zswap_never_enabled(void)
> return true;
> }
>
> +static inline bool zswap_present_test(swp_entry_t swp)
> +{
> + return false;
> +}
> +
> #endif
>
> #endif /* _LINUX_ZSWAP_H */
> diff --git a/mm/memory.c b/mm/memory.c
> index ff338c2abe92..5aa1ea9eb48a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4827,8 +4827,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> if (folio)
> swap_update_readahead(folio, vma, vmf->address);
> if (!folio) {
> - /* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices */
> - if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
> + /* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices and zswap */
> + if (data_race(si->flags & SWP_SYNCHRONOUS_IO) ||
> + zswap_present_test(entry))

This assumes that if the swap entry is in zswap, then the remaining
entries (covered by the readahead window) will also be in zswap,
right? While not very likely, it's possible that the remaining entries
not in zswap but on disk, right?

> folio = swapin_sync(entry, GFP_HIGHUSER_MOVABLE,
> thp_swapin_suitable_orders(vmf) | BIT(0),
> vmf, NULL, 0);
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 761cd699e0a3..5b85b4d17647 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -234,6 +234,17 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
> >> ZSWAP_ADDRESS_SPACE_SHIFT];
> }
>
> +/**
> + * zswap_present_test - check if a swap entry is currently backed by zswap
> + * @swp: the swap entry to test
> + *
> + * Return: true if @swp has a zswap entry, false otherwise.
> + */
> +bool zswap_present_test(swp_entry_t swp)

zswap_is_present()?

> +{
> + return xa_load(swap_zswap_tree(swp), swp_offset(swp));
> +}
> +
> #define zswap_pool_debug(msg, p) \
> pr_debug("%s pool %s\n", msg, (p)->tfm_name)
>
> --
> 2.54.0
>
>