Re: [RFC PATCH] mm: bypass swap readahead for zswap

From: Alexandre Ghiti

Date: Thu Jun 25 2026 - 03:46:48 EST


Hi Yosry,

On 6/24/26 20:01, Yosry Ahmed wrote:
On Wed, Jun 24, 2026 at 12:57 AM Alexandre Ghiti <alex@xxxxxxxx> wrote:
Commit 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous
device") made SWP_SYNCHRONOUS_IO devices (e.g. zram) skip swap readahead.

zswap is the same kind of in-memory, synchronous backend as zram, not a
swap device flagged SWP_SYNCHRONOUS_IO so it still goes through
swapin_readahead().

Here are the results from bypassing readahead for zswap too: it was
measured with a kernel build (make -j16) in a memcg, zswap=zstd, shrinker
off, on Sapphire Rapids and 3 iterations.

768M memcg (sustained swap thrash):
metric mm-new + bypass delta
build time (s) 405.0 341.7 -15.6%
zswap-in (GB) 79.5 53.0 -33%
zswap-out (GB) 144.8 115.6 -20%
swap readahead (pages) 6.79M 0.45M -93%
swap_ra hit (%) 72.1 89.9 +18pp

1G memcg (light pressure, build not memory-bound):
metric mm-new + bypass delta
build time (s) 177.7 176.0 ~same (no regression)
zswap-in (GB) 10.2 7.5 -26%
zswap-out (GB) 27.7 25.1 -9%
swap readahead (pages) 1.07M 0.08M -93%
swap_ra hit (%) 68.6 87.2 +19pp

The gain is from no longer prefetching pages that are pointless for an
in-memory backend: readahead inflates anon residency and thrashes the
page cache (file pages get evicted and re-read), lengthens each fault by
synchronously (de)compressing a cluster of neighbours, and adds
compression traffic when those extra pages are reclaimed.

Bypassing swap readahead for zswap therefore makes sense.

Signed-off-by: Alexandre Ghiti <alex@xxxxxxxx>
---

- This bypass originally comes from Usama's series that implements
large folio zswapin: while working on improving this series, I noticed
the gains I got only came from the bypass of readahead.

include/linux/zswap.h | 6 ++++++
mm/memory.c | 5 +++--
mm/zswap.c | 11 +++++++++++
3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 30c193a1207e..b6f0e6198b6f 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -35,6 +35,7 @@ void zswap_lruvec_state_init(struct lruvec *lruvec);
void zswap_folio_swapin(struct folio *folio);
bool zswap_is_enabled(void);
bool zswap_never_enabled(void);
+bool zswap_present_test(swp_entry_t swp);
#else

struct zswap_lruvec_state {};
@@ -69,6 +70,11 @@ static inline bool zswap_never_enabled(void)
return true;
}

+static inline bool zswap_present_test(swp_entry_t swp)
+{
+ return false;
+}
+
#endif

#endif /* _LINUX_ZSWAP_H */
diff --git a/mm/memory.c b/mm/memory.c
index ff338c2abe92..5aa1ea9eb48a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4827,8 +4827,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (folio)
swap_update_readahead(folio, vma, vmf->address);
if (!folio) {
- /* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices */
- if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
+ /* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices and zswap */
+ if (data_race(si->flags & SWP_SYNCHRONOUS_IO) ||
+ zswap_present_test(entry))
This assumes that if the swap entry is in zswap, then the remaining
entries (covered by the readahead window) will also be in zswap,
right? While not very likely, it's possible that the remaining entries
not in zswap but on disk, right?


Yes, I assumed locality here.

Indeed, it would be interesting to keep the readahead but only actually readahead swap disk entries. I don't know how this will affect the readahead window (it is computed from the number of PG_readahead hits iirc) but I can give it a try.



folio = swapin_sync(entry, GFP_HIGHUSER_MOVABLE,
thp_swapin_suitable_orders(vmf) | BIT(0),
vmf, NULL, 0);
diff --git a/mm/zswap.c b/mm/zswap.c
index 761cd699e0a3..5b85b4d17647 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -234,6 +234,17 @@ static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
>> ZSWAP_ADDRESS_SPACE_SHIFT];
}

+/**
+ * zswap_present_test - check if a swap entry is currently backed by zswap
+ * @swp: the swap entry to test
+ *
+ * Return: true if @swp has a zswap entry, false otherwise.
+ */
+bool zswap_present_test(swp_entry_t swp)
zswap_is_present()?


Agree, the naming is not perfect, I'll change that to either your proposal or zswap_is_entry_present() (or something else), but I'll definitely change that.

Thanks,

Alex



+{
+ return xa_load(swap_zswap_tree(swp), swp_offset(swp));
+}
+
#define zswap_pool_debug(msg, p) \
pr_debug("%s pool %s\n", msg, (p)->tfm_name)

--
2.54.0