[RFC PATCH 00/16] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE

From: SeongJae Park
Date: Wed Mar 05 2025 - 13:17:57 EST

Next message: SeongJae Park: "[RFC PATCH 01/16] mm/madvise: use is_memory_failure() from madvise_do_behavior()"
Previous message: SeongJae Park: "[RFC PATCH 06/16] mm/madvise: pass madvise_behavior struct to madvise_vma_behavior()"
Next in thread: SeongJae Park: "[RFC PATCH 03/16] mm/madvise: deduplicate madvise_do_behavior() skip case handlings"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

For MADV_DONTNEED[_LOCKED] or MADV_FREE madvise requests, tlb flushes
can happen for each vma of the given address ranges. Because such tlb
flushes are for address ranges of same process, doing those in a batch
is more efficient while still being safe. Modify madvise() and
process_madvise() entry level code path to do such batched tlb flushes,
while the internal unmap logics do only gathering of the tlb entries to
flush.

In more detail, modify the entry functions to initialize an mmu_gather
ojbect and pass it to the internal logics. Also modify the internal
logics to do only gathering of the tlb entries to flush into the
received mmu_gather object. After all internal function calls are done,
the entry functions finish the mmu_gather object to flush the gathered
tlb entries in the one batch.

Patches Seuquence
=================

First four patches are minor cleanups of madvise.c for readability.

Following four patches (patches 5-8) define new data structure for
managing information that required for batched tlb flushing (mmu_gather
and behavior), and update code paths for MADV_DONTNEED[_LOCKED] and
MADV_FREE handling internal logics to receive it.

Three patches (patches 9-11) for making internal MADV_DONTNEED[_LOCKED]
and MADV_FREE handling logic ready for batched tlb flushing follow. The
patches keep the support of unbatched tlb flushes use case, for
fine-grained and safe transitions.

Next three patches (patches 12-14) update madvise() and
process_madvise() code to do the batched tlb flushes utilizing the
previous patches introduced changes.

Final two patches (patches 15-16) clean up the internal logics'
unbatched tlb flushes use case support code, which is no more be used.

Test Results
============

I measured the time to apply MADV_DONTNEED advice to 256 MiB memory
using multiple process_madvise() calls. I apply the advice in 4 KiB
sized regions granularity, but with varying batch size (vlen) from 1 to
1024. The source code for the measurement is available at GitHub[1].

The measurement results are as below. 'sz_batches' column shows the
batch size of process_madvise() calls. 'before' and 'after' columns are
the measured time to apply MADV_DONTNEED to the 256 MiB memory buffer in
nanoseconds, on kernels that built without and with the MADV_DONTNEED
tlb flushes batching patch of this series, respectively. For the
baseline, mm-unstable tree of 2025-03-04[2] has been used.
'after/before' column is the ratio of 'after' to 'before'. So
'afetr/before' value lower than 1.0 means this patch increased
efficiency over the baseline. And lower value means better efficiency.

sz_batches before after after/before
1 102842895 106507398 1.03563204828102
2 73364942 74529223 1.01586971880929
4 58823633 51608504 0.877343022998937
8 47532390 44820223 0.942940655834895
16 43591587 36727177 0.842529018271347
32 44207282 33946975 0.767904595446515
64 41832437 26738286 0.639175910310939
128 40278193 23262940 0.577556694263817
256 41568533 22355103 0.537789077136785
512 41626638 22822516 0.54826709762148
1024 44440870 22676017 0.510251419470411

For <=2 batch size, tlb flushes batching shows no big difference but
slight overhead. I think that's in an error range of this simple
micro-benchmark, and therefore can be ignored. Starting from batch size
4, however, tlb flushes batching shows clear efficiency gain. The
efficiency gain tends to be proportional to the batch size, as expected.
The efficiency gain ranges from about 13 percent with batch size 4, and
up to 49 percent with batch size 1,024.

Please note that this is a very simple microbenchmark, so real
efficiency gain on real workload could be very different.

References
==========

[1] https://github.com/sjp38/eval_proc_madvise
[2] commit 7b6c5895bb9a ("mm: hugetlb: log time needed to allocate hugepages") # mm-unstable

SeongJae Park (16):
mm/madvise: use is_memory_failure() from madvise_do_behavior()
mm/madvise: split out populate behavior check logic
mm/madvise: deduplicate madvise_do_behavior() skip case handlings
mm/madvise: remove len parameter of madvise_do_behavior()
mm/madvise: define and use madvise_behavior struct for
madvise_do_behavior()
mm/madvise: pass madvise_behavior struct to madvise_vma_behavior()
mm/madvise: make madvise_walk_vmas() visit function receives a void
pointer
mm/madvise: pass madvise_behavior struct to madvise_dontneed_free()
mm/memory: split non-tlb flushing part from zap_page_range_single()
mm/madvise: let madvise_dontneed_single_vma() caller batches tlb
flushes
mm/madvise: let madvise_free_single_vma() caller batches tlb flushes
mm/madvise: batch tlb flushes for
process_madvise(MADV_DONTNEED[_LOCKED])
mm/madvise: batch tlb flushes for process_madvise(MADV_FREE)
mm/madvise: batch tlb flushes for
madvise(MADV_{DONTNEED[_LOCKED],FREE}
mm/madvise: remove !tlb support from madvise_dontneed_single_vma()
mm/madvise: remove !caller_tlb case of madvise_free_single_vma()

mm/internal.h | 3 +
mm/madvise.c | 176 ++++++++++++++++++++++++++++++++++----------------
mm/memory.c | 36 +++++++----
3 files changed, 144 insertions(+), 71 deletions(-)

base-commit: f653b037b4a6d83c68098fc3949090dfb63316fb
--
2.39.5

Next message: SeongJae Park: "[RFC PATCH 01/16] mm/madvise: use is_memory_failure() from madvise_do_behavior()"
Previous message: SeongJae Park: "[RFC PATCH 06/16] mm/madvise: pass madvise_behavior struct to madvise_vma_behavior()"
Next in thread: SeongJae Park: "[RFC PATCH 03/16] mm/madvise: deduplicate madvise_do_behavior() skip case handlings"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]