Re: [PATCH 0/9] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE

From: Lorenzo Stoakes
Date: Tue Mar 11 2025 - 08:50:08 EST


The below commit message talks about tlb bits, but you spend a lot of this
series refactoring mm/madvise.c.

Can we just separate out the two into separate series please?

I am doing the same kind of thing with mremap() at the moment, but sent the
refactor _first_ before the work that builds upon it :)

Your refactoring is great, so I want to be able to take that (and Andrew's
objections obviously don't apply there), and then we can address tlb stuff
separately and in a more focused way.

On Mon, Mar 10, 2025 at 10:23:09AM -0700, SeongJae Park wrote:
> When process_madvise() is called to do MADV_DONTNEED[_LOCKED] or
> MADV_FREE with multiple address ranges, tlb flushes happen for each of
> the given address ranges. Because such tlb flushes are for same
> process, doing those in a batch is more efficient while still being
> safe. Modify process_madvise() entry level code path to do such batched
> tlb flushes, while the internal unmap logics do only gathering of the
> tlb entries to flush.
>
> In more detail, modify the entry functions to initialize an mmu_gather
> ojbect and pass it to the internal logics. And make the internal logics
> do only gathering of the tlb entries to flush into the received
> mmu_gather object. After all internal function calls are done, the
> entry functions flush the gathered tlb entries at once.
>
> The inefficiency should be smaller on madvise() use case, since it
> receives only a single address range. But if there are multiple vmas
> for the range, same problem can happen. It is unclear if such use case
> is common and the inefficiency is significant. Make the change for
> madivse(), too, since it doesn't really change madvise() internal
> behavior while helps keeping the code that shared between
> process_madvise() and madvise() internal logics clean.
>
> Patches Seuquence
> =================
>
> First four patches are minor cleanups of madvise.c for readability.
>
> Fifth patch defines new data structure for managing information
> that required for batched tlb flushes (mmu_gather and behavior), and
> update code paths for MADV_DONTNEED[_LOCKED] and MADV_FREE handling
> internal logics to receive it.
>
> Sixth and seventh patches make internal logics for handling
> MADV_DONTNEED[_LOCKED] MADV_FREE be ready for batched tlb flushing. The
> patches keep the support of unbatched tlb flushes use case, for
> fine-grained and safe transitions.
>
> Eighth patch updates madvise() and process_madvise() code to do the
> batched tlb flushes utilizing the previous patches introduced changes.
>
> The final ninth patch removes the internal logics' unbatched tlb flushes
> use case support code, which is no more be used.
>
> Test Results
> ============
>
> I measured the latency to apply MADV_DONTNEED advice to 256 MiB memory
> using multiple process_madvise() calls. I apply the advice in 4 KiB
> sized regions granularity, but with varying batch size per
> process_madvise() call (vlen) from 1 to 1024. The source code for the
> measurement is available at GitHub[1]. To reduce measurement errors, I
> did the measurement five times.
>
> The measurement results are as below. 'sz_batch' column shows the batch
> size of process_madvise() calls. 'Before' and 'After' columns show the
> average of latencies in nanoseconds that measured five times on kernels
> that built without and with the tlb flushes batching patch of this
> series, respectively. For the baseline, mm-unstable tree of
> 2025-03-07[2] has been used. 'B-stdev' and 'A-stdev' columns show
> ratios of latency measurements standard deviation to average in percent
> for 'Before' and 'After', respectively. 'Latency_reduction' shows the
> reduction of the latency that the commit has achieved, in percent.
> Higher 'Latency_reduction' values mean more efficiency improvements.
>
> sz_batch Before B-stdev After A-stdev Latency_reduction
> 1 128691595.4 6.09 106878698.4 2.76 16.95
> 2 94046750.8 3.30 68691907 2.79 26.96
> 4 80538496.8 5.35 50230673.8 5.30 37.63
> 8 72672225.2 5.50 43918112 3.54 39.57
> 16 66955104.4 4.25 36549941.2 1.62 45.41
> 32 65814679 5.89 33060291 3.30 49.77
> 64 65099205.2 2.69 26003756.4 1.56 60.06
> 128 62591307.2 4.02 24008180.4 1.61 61.64
> 256 64124368.6 2.93 23868856 2.20 62.78
> 512 62325618 5.72 23311330.6 1.74 62.60
> 1024 64802138.4 5.05 23651695.2 3.40 63.50
>
> Interestingly, the latency has reduced (improved) even with batch size
> 1. I think some of compiler optimizations have affected that, like also
> observed with the previous process_madvise() mmap_lock optimization
> patch sereis[3].
>
> So, let's focus on the proportion between the improvement and the batch
> size. As expected, tlb flushes batching provides latency reduction that
> proportional to the batch size. The efficiency gain ranges from about
> 27 percent with batch size 2, and up to 63 percent with batch size
> 1,024.
>
> Please note that this is a very simple microbenchmark, so real
> efficiency gain on real workload could be very different.
>
> Changes from RFC
> (https://lore.kernel.org/20250305181611.54484-1-sj@xxxxxxxxxx)
> - Clarify motivation of the change on the cover letter
> - Add average and stdev of evaluation results
> - Show latency reduction on evaluation results
> - Fix !CONFIG_MEMORY_FAILURE build error
> - Rename is_memory_populate() to is_madvise_populate()
> - Squash patches 5-8
> - Add kerneldoc for unmap_vm_area_struct()
> - Squash patches 10 and 11
> - Squash patches 12-14
> - Squash patches 15 and 16
>
> References
> ==========
>
> [1] https://github.com/sjp38/eval_proc_madvise
> [2] commit e664d7d28a7c ("selftest: test system mappings are sealed") # mm-unstable
> [3] https://lore.kernel.org/20250211182833.4193-1-sj@xxxxxxxxxx
>
> SeongJae Park (9):
> mm/madvise: use is_memory_failure() from madvise_do_behavior()
> mm/madvise: split out populate behavior check logic
> mm/madvise: deduplicate madvise_do_behavior() skip case handlings
> mm/madvise: remove len parameter of madvise_do_behavior()
> mm/madvise: define and use madvise_behavior struct for
> madvise_do_behavior()
> mm/memory: split non-tlb flushing part from zap_page_range_single()
> mm/madvise: let madvise_{dontneed,free}_single_vma() caller batches
> tlb flushes
> mm/madvise: batch tlb flushes for
> [process_]madvise(MADV_{DONTNEED[_LOCKED],FREE})
> mm/madvise: remove !tlb support from
> madvise_{dontneed,free}_single_vma()
>
> mm/internal.h | 3 +
> mm/madvise.c | 221 +++++++++++++++++++++++++++++++++-----------------
> mm/memory.c | 38 ++++++---
> 3 files changed, 176 insertions(+), 86 deletions(-)
>
>
> base-commit: e993f5f5b0ac851cf60578cfee5488031dfaa80c
> --
> 2.39.5