[PATCH] mm/swap: free swap cache before mapping to fix folio reuse

From: Kairui Song

Date: Wed Jun 24 2026 - 23:44:26 EST

This is effectively a revert of commit 4b34f1d82c65 ("mm, swap: free the
swap cache after folio is mapped").

That commit was trying to reduce wasted fault and folio allocation
from parallel swapins of the same entry: keeping the folio in the swap
cache until the PTE is installed makes a racing fault more likely to find
it and wait on the folio lock, rather than allocate and charge a new folio
only to discover the race and throw it away, which leads to thrashing.

That benefit is marginal now. Folio allocation and swapin are both gated
on the swap entry count in the swap cache layer since commit 02d733a7ec1d
("mm, swap: unify large folio allocation"). Once the winning fault
drops the entry count, a racing fault that misses the cache will fail
the swap entry count check too and back off at the page table lock before
allocating anything. Note there is still a tiny window with or without
either patch, but it's already minimized now and should be fixed in
other ways later if needed.

Meanwhile the earlier commit has a real problem, as David pointed out [1].
The swap cache reference is still held during the write fault folio
reference check, so a sole-owned but non-exclusive write fault always
falls back to do_wp_page() instead of being reused in place. The
write-fault path of should_try_to_free_swap() is broken too as the
FAULT_FLAG_WRITE flag is missing.

There is no correctness problem though, only the reuse and cleanup
optimizations were lost.

So revert it. This makes the sole-owner check effective again and
recovers the reuse fast path. Also slightly adjust a sanity check
to be more rigorous.

Fixes: 4b34f1d82c65 ("mm, swap: free the swap cache after folio is mapped")
Reported-by: David Hildenbrand (Arm) <david@xxxxxxxxxx>
Closes: https://lore.kernel.org/linux-mm/e56c4d73-ed4a-48bb-8d0a-97b1200d4a35@xxxxxxxxxx/ [1]
Signed-off-by: Kairui Song <kasong@xxxxxxxxxxx>
---
mm/memory.c | 33 +++++++++++++++++++--------------
1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index ff338c2abe92..f31d6b8e8c0b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4512,7 +4512,6 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
static inline bool should_try_to_free_swap(struct swap_info_struct *si,
struct folio *folio,
struct vm_area_struct *vma,
- unsigned int extra_refs,
unsigned int fault_flags)
{
if (!folio_test_swapcache(folio))
@@ -4535,7 +4534,7 @@ static inline bool should_try_to_free_swap(struct swap_info_struct *si,
* reference only in case it's likely that we'll be the exclusive user.
*/
return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
- folio_ref_count(folio) == (extra_refs + folio_nr_pages(folio));
+ folio_ref_count(folio) == (1 + folio_nr_pages(folio));
}

static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
@@ -5033,6 +5032,24 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
*/
arch_swap_restore(folio_swap(entry, folio), folio);

+ /*
+ * Remove the swap entry and conditionally try to free up the swapcache.
+ * We're already holding a reference on the page but haven't mapped it
+ * yet.
+ *
+ * The swap count has to be freed to 0 first so folio_free_swap
+ * can free exclusive clean cache. Concurrent fault is very unlikely
+ * to trigger redundant folio alloc, nor will it cause redundant IO,
+ * as the swap entry count gates it.
+ */
+ VM_WARN_ON_ONCE(nr_pages != 1 && nr_pages != folio_nr_pages(folio));
+ if (folio != swapcache)
+ folio_put_swap(swapcache, NULL);
+ else
+ folio_put_swap(folio, nr_pages < folio_nr_pages(folio) ? page : NULL);
+ if (should_try_to_free_swap(si, folio, vma, vmf->flags))
+ folio_free_swap(folio);
+
add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
pte = mk_pte(page, vma->vm_page_prot);
@@ -5067,7 +5084,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (unlikely(folio != swapcache)) {
folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
folio_add_lru_vma(folio, vma);
- folio_put_swap(swapcache, NULL);
} else if (!folio_test_anon(folio)) {
/*
* We currently only expect !anon folios that are fully
@@ -5076,12 +5092,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) != nr_pages, folio);
VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio);
folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
- folio_put_swap(folio, NULL);
} else {
- VM_WARN_ON_ONCE(nr_pages != 1 && nr_pages != folio_nr_pages(folio));
folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
rmap_flags);
- folio_put_swap(folio, nr_pages == 1 ? page : NULL);
}

VM_BUG_ON(!folio_test_anon(folio) ||
@@ -5090,14 +5103,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
arch_do_swap_page_nr(vma->vm_mm, vma, address,
pte, pte, nr_pages);

- /*
- * Remove the swap entry and conditionally try to free up the swapcache.
- * Do it after mapping, so raced page faults will likely see the folio
- * in swap cache and wait on the folio lock.
- */
- if (should_try_to_free_swap(si, folio, vma, nr_pages, vmf->flags))
- folio_free_swap(folio);
-
folio_unlock(folio);
if (unlikely(folio != swapcache)) {
/*
--
2.54.0