Re: [RFC PATCH 3/3] mm: entirely remove lru_add_drain in do_swap_page

From: Barry Song

Date: Thu Jun 18 2026 - 21:57:51 EST

On Thu, Jun 18, 2026 at 3:29 PM Kairui Song <ryncsn@xxxxxxxxx> wrote:
>
> On Wed, Jun 17, 2026 at 5:55 PM Barry Song <baohua@xxxxxxxxxx> wrote:
> >
> > On Wed, Jun 17, 2026 at 1:38 PM Kairui Song <ryncsn@xxxxxxxxx> wrote:
> > >
> > > On Thu, Jun 11, 2026 at 6:52 PM Barry Song (Xiaomi) <baohua@xxxxxxxxxx> wrote:
> > > >
> > > > We are doing a lot of redundant lru_add_drain() calls in
> > > > do_swap_page(), especially for synchronous I/O devices. For
> > > > example, the test program below currently ends up draining
> > > > lru_cache 100% of the time:
> > > >
> > > > int main(int argc, char *argv[])
> > > > {
> > > > int i;
> > > > #define SIZE 100*1024*1024
> > > > while(1) {
> > > > volatile int *p = mmap(0, SIZE, PROT_READ | PROT_WRITE,
> > > > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > > >
> > > > for (int i = 0; i < SIZE/sizeof(int); i++)
> > > > p[i] = i%64;
> > > > madvise((void *)p, SIZE, MADV_PAGEOUT);
> > > > for (int i = 0; i < SIZE/sizeof(int); i++)
> > > > p[i] = i%64;
> > > > munmap(p, SIZE);
> > > > }
> > > > return 0;
> > > > }
> > > >
> > > > Folio reuse now relies primarily on the exclusive hint, making
> > > > lru_cache draining to drop the refcount in lru_cache largely
> > > > irrelevant.
> > > >
> > > > Signed-off-by: Barry Song (Xiaomi) <baohua@xxxxxxxxxx>
> > > > ---
> > > > mm/memory.c | 10 ----------
> > > > 1 file changed, 10 deletions(-)
> > > >
> > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > index ce8ef27e7a54..b5a78670bcc8 100644
> > > > --- a/mm/memory.c
> > > > +++ b/mm/memory.c
> > > > @@ -4903,16 +4903,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > > } else if (folio != swapcache)
> > > > page = folio_page(folio, 0);
> > > >
> > > > - /*
> > > > - * If we want to map a page that's in the swapcache writable, we
> > > > - * have to detect via the refcount if we're really the exclusive
> > > > - * owner. Try removing the extra reference from the local LRU
> > > > - * caches if required.
> > > > - */
> > > > - if ((vmf->flags & FAULT_FLAG_WRITE) &&
> > > > - !folio_test_ksm(folio) && !folio_test_lru(folio))
> > > > - lru_add_drain();
> > > > -
> > > > folio_throttle_swaprate(folio, GFP_KERNEL);
> > > >
> > > > /*
> > > > --
> > > > 2.39.3 (Apple Git-146)
> > > >
> > > >
> > >
> > > There is a ref check in should_try_to_free_swap though, perhaps we
> > > should also improve that part too or we may have more more folios
> > > stuck in swap cache after this change?
> >
> > Good catch. We could assume that a !LRU folio has an extra reference held
> > by lru_cache. Does the below make sense to you?
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index ff338c2abe92..237220ae1572 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -5094,8 +5094,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > * Remove the swap entry and conditionally try to free up the swapcache.
> > * Do it after mapping, so raced page faults will likely see the folio
> > * in swap cache and wait on the folio lock.
> > + * Assume that lru_cache contributes an extra reference for non-LRU
> > + * folios.
> > */
> > - if (should_try_to_free_swap(si, folio, vma, nr_pages, vmf->flags))
> > + if (should_try_to_free_swap(si, folio, vma, nr_pages +
> > + !folio_test_lru(folio), vmf->flags))
> > folio_free_swap(folio);
> >
> > folio_unlock(folio);
>
> Right, this seems better. I've been thinking if we can just remove
> that ref check in should_try_to_free_swap and always return true to
> reclaim the cache for write faults. However, that seems like a bad
> idea after all since shared clean swapin folios on non-sync devices
> might benefit significantly from being cached, and such cases are
> unlikely to be very rare.

I don’t quite understand what you mean. Even if we remove the
refcheck folio_ref_count(folio) == (extra_refs + folio_nr_pages(folio))
from should_try_to_free_swap(), folio_maybe_swapped() would still
prevent swapcache from being deleted for shared folios. The only
effect is that we end up doing more work:

rcu_read_lock();
do {
if (__swp_tb_get_count(__swap_table_get(ci, ci_off))) {
ret = true;
break;
}
} while (++ci_off < ci_end);
rcu_read_unlock();

and should_try_to_free_swap() currently checks
fault_flags & FAULT_FLAG_WRITE, so clean folios are not
touched for their swapcache.

The only thing I feel might not be perfect is sync I/O, which I
think should be:

diff --git a/mm/memory.c b/mm/memory.c
index ff338c2abe92..3c231dafefcb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4517,14 +4517,6 @@ static inline bool
should_try_to_free_swap(struct swap_info_struct *si,
{
if (!folio_test_swapcache(folio))
return false;
- /*
- * Always try to free swap cache for SWP_SYNCHRONOUS_IO devices. Swap
- * cache can help save some IO or memory overhead, but these devices
- * are fast, and meanwhile, swap cache pinning the slot deferring the
- * release of metadata or fragmentation is a more critical issue.
- */
- if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
- return true;
if (mem_cgroup_swap_full(folio) || (vma->vm_flags & VM_LOCKED) ||
folio_test_mlocked(folio))
return true;
@@ -4533,8 +4525,12 @@ static inline bool
should_try_to_free_swap(struct swap_info_struct *si,
* have to detect via the refcount if we're really the exclusive
* user. Try freeing the swapcache to get rid of the swapcache
* reference only in case it's likely that we'll be the exclusive user.
+ * Always try to free swap cache for SWP_SYNCHRONOUS_IO devices, even
+ * when swapin is for read, to release memory used for compressed data,
+ * etc.
*/
- return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
+ return ((fault_flags & FAULT_FLAG_WRITE) || data_race(si->flags &
+ SWP_SYNCHRONOUS_IO)) && !folio_test_ksm(folio) &&
folio_ref_count(folio) == (extra_refs + folio_nr_pages(folio));
}

Not sure if it is worth it. I feel it would only save some
overhead in folio_maybe_swapped() when we have many shared
folios.

>
> Just one thing: I think maybe we can have a more common helper to be
> shared by `wp_can_reuse_anon_folio` and `should_try_to_free_swap`.
> Since the cache reclaim in `should_try_to_free_swap` is meant to help
> `wp_can_reuse_anon_folio` (IIUC), keeping them more consistent seems
> easier to maintain and understand. Trivial though.
>
> It's fine this way too, it's technically correct and this is only a
> heuristic check which doesn't effect correctness.

thanks! I would stick with this approach for now, as
wp_can_reuse_anon_folio() and do_swap_page() seem to differ
significantly in how they handle SYNC I/O, folio locking, etc.

Best Regards
Barry