Re: [PATCH v2] mm/vmscan: batch TLB flush during memory reclaim

From: Vinay Banakar
Date: Fri Apr 04 2025 - 09:37:35 EST

Next message: Steven Rostedt: "Re: [PATCH] tracing: Replace deprecated strncpy() with memcpy() for stack_trace_filter_buf"
Previous message: Halil Pasic: "Re: [PATCH v1] s390/virtio_ccw: don't allocate/assign airqs for non-existing queues"
In reply to: Shakeel Butt: "Re: [PATCH v2] mm/vmscan: batch TLB flush during memory reclaim"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Apr 3, 2025 at 5:00 PM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
> Were any runtime benefits observable?

I had replied as follows on another chain related to this patch:

Yes, the patch reduces IPIs by a factor of 512 by sending one IPI (for TLB
flush) per PMD rather than per page. Since shrink_folio_list()
usually operates on one PMD at a time, I believe we can safely batch
these operations here, but I would appreciate your feedback on this.

Here's a concrete example:
When swapping out 20 GiB (5.2M pages):
- Current: Each page triggers an IPI to all cores
- With 6 cores: 31.4M total interrupts (6 cores × 5.2M pages)
- With patch: One IPI per PMD (512 pages)
- Only 10.2K IPIs required (5.2M/512)
- With 6 cores: 61.4K total interrupts
- Results in ~99% reduction in total interrupts

Application performance impact varies by workload, but here's a
representative test case:
- Thread 1: Continuously accesses a 2 GiB private anonymous map (64B
chunks at random offsets)
- Thread 2: Pinned to different core, uses MADV_PAGEOUT on 20 GiB
private anonymous map to swap it out to SSD
- The threads only access their respective maps.
Results:
- Without patch: Thread 1 sees ~53% throughput reduction during
swap. If there are multiple worker threads (like thread 1), the
cumulative throughput degradation will be much higher
- With patch: Thread 1 maintains normal throughput

On Thu, Apr 3, 2025 at 5:00 PM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Fri, 28 Mar 2025 14:20:55 -0400 Rik van Riel <riel@xxxxxxxxxxx> wrote:
>
> > The current implementation in shrink_folio_list() performs a full TLB
> > flush for every individual folio reclaimed. This causes unnecessary
> > overhead during memory reclaim.
> >
> > The current code:
> > 1. Clears PTEs and unmaps each page individually
> > 2. Performs a full TLB flush on every CPU the mm is running on
> >
> > The new code:
> > 1. Clears PTEs and unmaps each page individually
> > 2. Adds each unmapped page to pageout_folios
> > 3. Flushes the TLB once before procesing pageout_folios
> >
> > This reduces the number of TLB flushes issued by the memory reclaim
> > code by 1/N, where N is the number of mapped folios encountered in
> > the batch processed by shrink_folio_list.
>
> Were any runtime benefits observable?

Next message: Steven Rostedt: "Re: [PATCH] tracing: Replace deprecated strncpy() with memcpy() for stack_trace_filter_buf"
Previous message: Halil Pasic: "Re: [PATCH v1] s390/virtio_ccw: don't allocate/assign airqs for non-existing queues"
In reply to: Shakeel Butt: "Re: [PATCH v2] mm/vmscan: batch TLB flush during memory reclaim"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]