Re: [PATCH v8 3/3] mm/madvise: optimize lazyfreeing with mTHP in madvise_free

From: David Hildenbrand
Date: Wed Apr 17 2024 - 13:06:26 EST


On 17.04.24 16:14, Lance Yang wrote:
This patch optimizes lazyfreeing with PTE-mapped mTHP[1]
(Inspired by David Hildenbrand[2]). We aim to avoid unnecessary folio
splitting if the large folio is fully mapped within the target range.

If a large folio is locked or shared, or if we fail to split it, we just
leave it in place and advance to the next PTE in the range. But note that
the behavior is changed; previously, any failure of this sort would cause
the entire operation to give up. As large folios become more common,
sticking to the old way could result in wasted opportunities.

On an Intel I5 CPU, lazyfreeing a 1GiB VMA backed by PTE-mapped folios of
the same size results in the following runtimes for madvise(MADV_FREE) in
seconds (shorter is better):

Folio Size | Old | New | Change
------------------------------------------
4KiB | 0.590251 | 0.590259 | 0%
16KiB | 2.990447 | 0.185655 | -94%
32KiB | 2.547831 | 0.104870 | -95%
64KiB | 2.457796 | 0.052812 | -97%
128KiB | 2.281034 | 0.032777 | -99%
256KiB | 2.230387 | 0.017496 | -99%
512KiB | 2.189106 | 0.010781 | -99%
1024KiB | 2.183949 | 0.007753 | -99%
2048KiB | 0.002799 | 0.002804 | 0%

[1] https://lkml.kernel.org/r/20231207161211.2374093-5-ryan.roberts@xxxxxxx
[2] https://lore.kernel.org/linux-mm/20240214204435.167852-1-david@xxxxxxxxxx

Signed-off-by: Lance Yang <ioworker0@xxxxxxxxx>

Some of the changes could have been moved into separate patches to ease review ;)

At least the folio_pte_batch() change and factoring out some stuff from madvise_cold_or_pageout_pte_range(). But see below on the latter.

---
mm/internal.h | 12 ++++-
mm/madvise.c | 141 ++++++++++++++++++++++++++++----------------------
mm/memory.c | 4 +-
3 files changed, 91 insertions(+), 66 deletions(-)

[...]

diff --git a/mm/madvise.c b/mm/madvise.c
index f5e3699e7b54..d6f1889d6308 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -321,6 +321,39 @@ static inline bool can_do_file_pageout(struct vm_area_struct *vma)
file_permission(vma->vm_file, MAY_WRITE) == 0;
}
+static inline int madvise_folio_pte_batch(unsigned long addr, unsigned long end,
+ struct folio *folio, pte_t *ptep,
+ pte_t pte, bool *any_young,
+ bool *any_dirty)
+{
+ int max_nr = (end - addr) / PAGE_SIZE;
+ const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;

Reverse Christmas tree looks nicer ;)

+
+ return folio_pte_batch(folio, addr, ptep, pte, max_nr, fpb_flags, NULL,
+ any_young, any_dirty);
+}
+
+static inline bool madvise_pte_split_folio(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long addr,
+ struct folio *folio, pte_t **pte,
+ spinlock_t **ptl)
+{
+ int err;
+
+ if (!folio_trylock(folio))
+ return false;
+
+ folio_get(folio);
+ pte_unmap_unlock(*pte, *ptl);
+ err = split_folio(folio);
+ folio_unlock(folio);
+ folio_put(folio);
+
+ *pte = pte_offset_map_lock(mm, pmd, addr, ptl);

Staring at this helper again, I am really not sure if we should have it. Calling semantics are "special" and that pte_t **pte is just ... "special" as well ;)

Can we just leave that part as is, in the caller? That would also mean less madvise_cold_or_pageout_pte_range() churn ... which i would welcome as part of this patch.

[...]

@@ -741,19 +767,10 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
}
if (pte_young(ptent) || pte_dirty(ptent)) {
- /*
- * Some of architecture(ex, PPC) don't update TLB
- * with set_pte_at and tlb_remove_tlb_entry so for
- * the portability, remap the pte with old|clean
- * after pte clearing.
- */
- ptent = ptep_get_and_clear_full(mm, addr, pte,
- tlb->fullmm);
-
- ptent = pte_mkold(ptent);
- ptent = pte_mkclean(ptent);
- set_pte_at(mm, addr, pte, ptent);
- tlb_remove_tlb_entry(tlb, pte, addr);
+ clear_young_dirty_ptes(vma, addr, pte, nr,
+ CYDP_CLEAR_YOUNG |
+ CYDP_CLEAR_DIRTY);

That indent looks odd. I suggest simply having a local variable

const cydp_t cydp_flags = CYDP_CLEAR_YOUNG | CYDP_CLEAR_DIRTY;

and then use cydp_flags here that will make this easier to read.

--
Cheers,

David / dhildenb