Re: [PATCH v2 4/4] mm: try to free swapcache for non-LRU folios

From: David Hildenbrand (Arm)

Date: Tue Jun 30 2026 - 01:48:46 EST

On 6/30/26 01:59, Barry Song wrote:
> On Sat, Jun 27, 2026 at 12:36 AM David Hildenbrand (Arm)
> <david@xxxxxxxxxx> wrote:
>>
>>>
>>> That's awkward, my bad, terribly sorry about this :(
>>>
>>> I sincerely apologize for the oversight and the trouble this has caused.
>>>
>>
>> All good, I was just surprised to see a previous optimization partially
>> reverted without a clear reasoning :)
>>
>> Because it should have removed the handling in should_try_to_free_swap() as well.
>>
>> It's good that we are discussing it now!
>>
>>> I haven't seen any performance regression in any workload recently
>>> though, or any correctness issue, perhaps the round trip of
>>> do_wp_page wasn't that bad. It should still catch the reuse folios,
>>> just more costly than doing things in-place.
>>
>> Right, do_wp_page() handles it, after the page was mapped. It adds some overhead,
>> but fortunately no TLB flush if we're just upgrading write permissions.
>>
>> The optimization dates back to pre PageAnonExclusive handling.
>>
>>>
>>> I think we should restore the original check first. We might also want to
>>> avoid dropping the swap cache if the folio will not be reused, which
>>> was discussed here:
>>>
>>> https://lore.kernel.org/linux-mm/CAMgjq7BDfvNXdWH0cqarsujjUn3i3tDDhDkmSg01TR4h-tDorQ@xxxxxxxxxxxxxx/
>>>
>>> Maybe extracting some common part into a helper can help make this
>>> cleaner.
>>>
>>>
>>> Hi Barry,
>>>
>>> The problem is more than that, the `exclusive || folio_ref_count(folio) == 1`
>>> in do_swap_page is also ineffective now.
>>
>> Exactly.
>>
>> If the roundtrip through do_wp_page() is good enough today, we can just do
>>
>> @@ -4512,7 +4516,6 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
>> static inline bool should_try_to_free_swap(struct swap_info_struct *si,
>> struct folio *folio,
>> struct vm_area_struct *vma,
>> - unsigned int extra_refs,
>> unsigned int fault_flags)
>> {
>> if (!folio_test_swapcache(folio))
>> @@ -4528,14 +4531,7 @@ static inline bool should_try_to_free_swap(struct swap_info_struct *si,
>> if (mem_cgroup_swap_full(folio) || (vma->vm_flags & VM_LOCKED) ||
>> folio_test_mlocked(folio))
>> return true;
>> - /*
>> - * If we want to map a page that's in the swapcache writable, we
>> - * have to detect via the refcount if we're really the exclusive
>> - * user. Try freeing the swapcache to get rid of the swapcache
>> - * reference only in case it's likely that we'll be the exclusive user.
>> - */
>> - return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
>> - folio_ref_count(folio) == (extra_refs + folio_nr_pages(folio));
>> + return false;
>> }
>>
>> static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
>> @@ -5095,7 +5091,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> * Do it after mapping, so raced page faults will likely see the folio
>> * in swap cache and wait on the folio lock.
>> */
>> - if (should_try_to_free_swap(si, folio, vma, nr_pages, vmf->flags))
>> + if (should_try_to_free_swap(si, folio, vma, vmf->flags))
>> folio_free_swap(folio);
>>
>> folio_unlock(folio);
>>
>> But then, one question is whether we'd actually want to try removing the swapcache when
>> we mapped the page writable (iow: exclusive)?
>
> I guess this question comes from my earlier commit c18160dba5ff
> ("mm: swap: reuse exclusive folio directly instead of wp page faults"),
> where the folio is reused even for READ faults. In that case, we
> would miss do_wp_page(), which could later drop the swapcache?
> Also, again, nobody has reported any regression for this.
>
> Holding swapcache for a clean folio for non-sync swap I/O has the
> benefit of avoiding a potential pageout(). Now, reuse even for read
> faults and always dropping swapcache seems to somewhat defeat that
> benefit. On the other hand, we always drop swapcache for sync I/O
> to avoid the copy in zRAM or elsewhere consuming memory, so it seems
> safe enough to always enable reuse in do_swap_page() for sync I/O.

Right, I meant during write faults, when a write is expected. See below.

>
>>
>>
>> @@ -4512,7 +4516,7 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
>> static inline bool should_try_to_free_swap(struct swap_info_struct *si,
>> struct folio *folio,
>> struct vm_area_struct *vma,
>> - unsigned int extra_refs,
>> + bool exclusive,
>> unsigned int fault_flags)
>> {
>> if (!folio_test_swapcache(folio))
>> @@ -4529,13 +4533,11 @@ static inline bool should_try_to_free_swap(struct swap_info_struct *si,
>> folio_test_mlocked(folio))
>> return true;
>> /*
>> - * If we want to map a page that's in the swapcache writable, we
>> - * have to detect via the refcount if we're really the exclusive
>> - * user. Try freeing the swapcache to get rid of the swapcache
>> - * reference only in case it's likely that we'll be the exclusive user.
>> + * We have an exclusive page that was mapped writable or will soon
>> + * be mapped writable (as we are in a write fault). Let's just try
>> + * to reclaim swap immediately.
>> */
>> - return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
>> - folio_ref_count(folio) == (extra_refs + folio_nr_pages(folio));
>> + return (fault_flags & FAULT_FLAG_WRITE) && exclusive;
>
> I assume you mean (fault_flags & FAULT_FLAG_WRITE) || exclusive, or

No, I meant "we are serving a write fault (write is definitely going to happen)
and we are definitely reusing the page (exclusive).

> we can just remove (fault_flags & FAULT_FLAG_WRITE) and use
> "exclusive" instead since we are always using do_wp_page() now,

If you drop the "FAULT_FLAG_WRITE", you'd remove clean pages (that will likely
stay clean as no write fault) from the swapcache, As you correctly say above,
can avoid a pageout(), so I think we should keep that.

> and FAULT_FLAG_WRITE in fault_flags could have been cleared
> by the reuse of do_swap_page().

Ah, yes, that existing handling is nasty. We should look into not messing with
fault flags. Something like the following

diff --git a/mm/memory.c b/mm/memory.c
index ff338c2abe923..b0d8f3674525b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5052,10 +5052,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if ((vma->vm_flags & VM_WRITE) && !userfaultfd_pte_wp(vma, pte) &&
!pte_needs_soft_dirty_wp(vma, pte)) {
pte = pte_mkwrite(pte, vma);
- if (vmf->flags & FAULT_FLAG_WRITE) {
+ if (vmf->flags & FAULT_FLAG_WRITE)
pte = pte_mkdirty(pte);
- vmf->flags &= ~FAULT_FLAG_WRITE;
- }
}
rmap_flags |= RMAP_EXCLUSIVE;
}
@@ -5112,7 +5110,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
folio_put(swapcache);
}

- if (vmf->flags & FAULT_FLAG_WRITE) {
+ if ((vmf->flags & FAULT_FLAG_WRITE) && !pte_write(pte)) {
ret |= do_wp_page(vmf);
if (ret & VM_FAULT_ERROR)
ret &= VM_FAULT_ERROR;

--
Cheers,

David