Re: [PATCH] mm/swap_state: remove unnecessary lru_add_drain() from readahead

From: Barry Song

Date: Tue Jun 09 2026 - 06:05:29 EST

On Tue, Jun 9, 2026 at 5:30 PM Usama Arif <usama.arif@xxxxxxxxx> wrote:
>
>
>
> On 09/06/2026 09:01, Barry Song wrote:
> > On Mon, Jun 8, 2026 at 10:33 PM Usama Arif <usama.arif@xxxxxxxxx> wrote:
> >>
> >> swap_cluster_readahead() and swap_vma_readahead() end the readahead
> >> loop with an explicit lru_add_drain() call. That drain is a leftover
> >> from 2.6.12 era code and serves no functional purpose for the callers:
> >>
> >> - do_swap_page() ignores LRU residency for the readahead folios;
> >> it only needs the target folio it called swapin_readahead() for,
> >> and if the write-fault path needs the target folio on the LRU to count
> >> references accurately, it runs its own lru_add_drain() at the
> >> wp_can_reuse_anon_folio() and do_swap_page() sites.
> >
> > right. as i can see the below in do_swap_page():
> >
> > /*
> > * If we want to map a page that's in the swapcache writable, we
> > * have to detect via the refcount if we're really the exclusive
> > * owner. Try removing the extra reference from the local LRU
> > * caches if required.
> > */
> > if ((vmf->flags & FAULT_FLAG_WRITE) &&
> > !folio_test_ksm(folio) && !folio_test_lru(folio))
> > lru_add_drain();
> >
> > and the below in wp_can_reuse_anon_folio():
> >
> > if (!folio_test_lru(folio))
> > /*
> > * We cannot easily detect+handle references from
> > * remote LRU caches or references to LRU folios.
> > */
> > lru_add_drain();
> >
> >>
> >> - shmem_swapin_cluster() immediately locks the returned folio, waits
> >> for writeback, then operates on it - LRU residency of either the target
> >> or the readahead folios is irrelevant.
> >>
> >> - try_to_unuse() likewise locks the folio and calls unuse_pte() without
> >> depending on LRU presence.
> >>
> >> Folios newly added to the swap cache by the readahead loop sit in
> >> the per-CPU LRU folio_batch and will be drained naturally as the
> >> batch fills (FOLIO_BATCH_SIZE),by the next reclaim/compaction
> >> lru_add_drain_all() and so on. The unconditional drain only
> >> synchronously flushes a partial batch and forces contention on
> >> lruvec_lock.
> >>
> >> On a 176-CPU production host running a memory-pressured workload, this
> >> path was observed to call folio_batch_move_lru() from
> >> swap_cluster_readahead() ~28K/min, a very large source of LRU lock
> >> traffic.
> >>
> >
> > Do we see a workload improvement? If yes, can we put the data?
> >
>
> Hello Barry!
>
> So lru lock contention is a source of issue in the meta fleet.
>
> This problem was specifically seen when I ran `perf lock contention -a -b`
> in production on a workload that has a really big anon heap and heavy swap
> activity.
>
> When I tried to trace with bpftrace who was the biggest consumer, it was
> readahead.
>
> It is easy to run perf and bpftrace on prod on this specific workload, but
> more difficult to flash a new kernel and see results. The easiest would be
> when kernel upgrade happens and this patch lands to see the difference and
> I can report back.

Yes, it seems fairly straightforward to reduce LRU lock contention.
>From the review, the patch looks good. I’m just not certain whether
we might be missing any corner cases.

I would be happy to see it queued for testing once the mm tree is
ready to accept new patches. Is it too late at the moment?

Feel free to add:

Reviewed-by: Barry Song <baohua@xxxxxxxxxx>

Thanks,
Barry