Re: [RFC PATCH] mm: bypass swap readahead for zswap

From: Barry Song

Date: Wed Jun 24 2026 - 15:25:16 EST

On Wed, Jun 24, 2026 at 3:57 PM Alexandre Ghiti <alex@xxxxxxxx> wrote:
>
> Commit 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous
> device") made SWP_SYNCHRONOUS_IO devices (e.g. zram) skip swap readahead.
>
> zswap is the same kind of in-memory, synchronous backend as zram, not a
> swap device flagged SWP_SYNCHRONOUS_IO so it still goes through
> swapin_readahead().
>
> Here are the results from bypassing readahead for zswap too: it was
> measured with a kernel build (make -j16) in a memcg, zswap=zstd, shrinker
> off, on Sapphire Rapids and 3 iterations.
>
> 768M memcg (sustained swap thrash):
> metric mm-new + bypass delta
> build time (s) 405.0 341.7 -15.6%
> zswap-in (GB) 79.5 53.0 -33%
> zswap-out (GB) 144.8 115.6 -20%
> swap readahead (pages) 6.79M 0.45M -93%
> swap_ra hit (%) 72.1 89.9 +18pp
>
> 1G memcg (light pressure, build not memory-bound):
> metric mm-new + bypass delta
> build time (s) 177.7 176.0 ~same (no regression)
> zswap-in (GB) 10.2 7.5 -26%
> zswap-out (GB) 27.7 25.1 -9%
> swap readahead (pages) 1.07M 0.08M -93%
> swap_ra hit (%) 68.6 87.2 +19pp
>
> The gain is from no longer prefetching pages that are pointless for an
> in-memory backend: readahead inflates anon residency and thrashes the
> page cache (file pages get evicted and re-read), lengthens each fault by
> synchronously (de)compressing a cluster of neighbours, and adds
> compression traffic when those extra pages are reclaimed.
>
> Bypassing swap readahead for zswap therefore makes sense.
>
> Signed-off-by: Alexandre Ghiti <alex@xxxxxxxx>
> ---
>
> - This bypass originally comes from Usama's series that implements
> large folio zswapin: while working on improving this series, I noticed
> the gains I got only came from the bypass of readahead.
>
> include/linux/zswap.h | 6 ++++++
> mm/memory.c | 5 +++--
> mm/zswap.c | 11 +++++++++++
> 3 files changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> index 30c193a1207e..b6f0e6198b6f 100644
> --- a/include/linux/zswap.h
> +++ b/include/linux/zswap.h
> @@ -35,6 +35,7 @@ void zswap_lruvec_state_init(struct lruvec *lruvec);
> void zswap_folio_swapin(struct folio *folio);
> bool zswap_is_enabled(void);
> bool zswap_never_enabled(void);
> +bool zswap_present_test(swp_entry_t swp);
> #else
>
> struct zswap_lruvec_state {};
> @@ -69,6 +70,11 @@ static inline bool zswap_never_enabled(void)
> return true;
> }
>
> +static inline bool zswap_present_test(swp_entry_t swp)
> +{
> + return false;
> +}
> +
> #endif
>
> #endif /* _LINUX_ZSWAP_H */
> diff --git a/mm/memory.c b/mm/memory.c
> index ff338c2abe92..5aa1ea9eb48a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4827,8 +4827,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> if (folio)
> swap_update_readahead(folio, vma, vmf->address);
> if (!folio) {
> - /* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices */
> - if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
> + /* Swapin bypasses readahead for SWP_SYNCHRONOUS_IO devices and zswap */
> + if (data_race(si->flags & SWP_SYNCHRONOUS_IO) ||
> + zswap_present_test(entry))
> folio = swapin_sync(entry, GFP_HIGHUSER_MOVABLE,
> thp_swapin_suitable_orders(vmf) | BIT(0),
> vmf, NULL, 0);

Basically, I have been seeing the same issue recently. If the
readahead swap entries are also in zswap, we end up doing the
decompression during one page fault, but then need another page fault
to fetch the page from the swap cache and install the mapping. In that
case, readahead may not be beneficial.

On the other hand, if the readahead swap entries are not in zswap, the
situation is different.

For example, suppose we fault on the swap entry for address 1 MB and
readahead brings in the entry for 1 MB + 4 KB. If both entries are in
zswap, readahead does not seem like a good trade-off. However, if the
1 MB + 4 KB entry is not in zswap and would otherwise require storage
I/O, then readahead can be beneficial.

So I implemented a rather ugly fault_around-like mechanism in
do_swap_page(). At least with page-cluster == 1, I am seeing a
performance improvement, as the readahead folios can be mapped
directly and do not require a second page fault.

It is admittedly quite ugly and is only meant as a proof of concept :-)

Subject: [PATCH PoC] mm: enable do_swap_page fault_around

Signed-off-by: Barry Song (Xiaomi) <baohua@xxxxxxxxxx>
---
mm/memory.c | 95 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 95 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index c00a31a6d1d0..1db79f45a575 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4736,6 +4736,100 @@ static void check_swap_exclusive(struct folio
*folio, swp_entry_t entry,
} while (--nr_pages);
}

+static void do_swap_map_around(struct vm_fault *vmf, struct
swap_info_struct *si)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ int nr_around = 1 << page_cluster;
+ unsigned long start = max3(vma->vm_start, vmf->address -
(nr_around - 1) * PAGE_SIZE,
+ vmf->address & PMD_MASK);
+ unsigned long end = min3(vma->vm_end, vmf->address +
nr_around * PAGE_SIZE,
+ (vmf->address & PMD_MASK) + PMD_SIZE);
+ unsigned long nr_pages = (end - start) >> PAGE_SHIFT;
+ unsigned long delta_pages = (vmf->address - start) >> PAGE_SHIFT;
+ pte_t *ptep = vmf->pte - delta_pages;
+
+ for (int i = 0; i < nr_pages; i++, ptep++) {
+ unsigned long address = start + (i << PAGE_SHIFT);
+ rmap_t rmap_flags = RMAP_NONE;
+ pte_t orig_pte, pte;
+ struct folio *folio;
+ struct page *page;
+ softleaf_t entry;
+ bool exclusive;
+
+ if (ptep == vmf->pte)
+ continue;
+ orig_pte = ptep_get(ptep);
+ exclusive = pte_swp_exclusive(orig_pte);
+ if (!exclusive)
+ continue;
+ entry = softleaf_from_pte(orig_pte);
+ if (!softleaf_is_swap(entry))
+ continue;
+ folio = swap_cache_get_folio(entry);
+ if (!folio)
+ continue;
+ if (unlikely(!folio_matches_swap_entry(folio, entry)))
+ goto skip;
+ if (folio_test_locked(folio))
+ goto skip;
+ if (!folio_test_uptodate(folio))
+ goto skip;
+ if (!folio_trylock(folio))
+ goto skip;
+ if (folio_test_ksm(folio) || folio_test_large(folio) ||
+ !folio_test_uptodate(folio))
+ goto unlock;
+ if (exclusive && folio_test_writeback(folio) &&
+ data_race(si->flags & SWP_STABLE_WRITES))
+ exclusive = false;
+
+ arch_swap_restore(folio_swap(entry, folio), folio);
+
+ page = folio_page(folio, 0);
+ add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
+ add_mm_counter(vma->vm_mm, MM_SWAPENTS, -1);
+ pte = mk_pte(page, vma->vm_page_prot);
+ if (pte_swp_soft_dirty(orig_pte))
+ pte = pte_mksoft_dirty(pte);
+ if (pte_swp_uffd_wp(orig_pte))
+ pte = pte_mkuffd_wp(pte);
+
+ if (exclusive) {
+ if ((vma->vm_flags & VM_WRITE) &&
!userfaultfd_pte_wp(vma, pte) &&
+ !pte_needs_soft_dirty_wp(vma, pte)) {
+ pte = pte_mkwrite(pte, vma);
+ }
+ rmap_flags |= RMAP_EXCLUSIVE;
+ }
+ flush_icache_pages(vma, page, 1);
+
+ if (!folio_test_anon(folio)) {
+ folio_add_new_anon_rmap(folio, vma, address,
rmap_flags);
+ folio_put_swap(folio, NULL);
+ } else {
+ folio_add_anon_rmap_ptes(folio, page, 1, vma, address,
+ rmap_flags);
+ folio_put_swap(folio, page);
+ }
+
+ set_ptes(vma->vm_mm, address, ptep, pte, 1);
+ arch_do_swap_page_nr(vma->vm_mm, vma, address,
+ pte, pte, 1);
+
+ if (should_try_to_free_swap(si, folio, vma, 1, vmf->flags))
+ folio_free_swap(folio);
+ folio_unlock(folio);
+ swap_update_readahead(folio, vma, address);
+ update_mmu_cache_range(vmf, vma, address, ptep, 1);
+ continue;
+unlock:
+ folio_unlock(folio);
+skip:
+ folio_put(folio);
+ };
+}
+
/*
* We enter with either the VMA lock or the mmap_lock held (see
* FAULT_FLAG_VMA_LOCK), and pte mapped but not yet locked.
@@ -5121,6 +5215,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)

/* No need to invalidate - it was non-present before */
update_mmu_cache_range(vmf, vma, address, ptep, nr_pages);
+ do_swap_map_around(vmf, si);
unlock:
if (vmf->pte)
pte_unmap_unlock(vmf->pte, vmf->ptl);
--
2.34.1