Re: [PATCH v12 1/8] mm: support madvise(MADV_FREE)

From: Minchan Kim
Date: Thu Jul 17 2014 - 22:22:06 EST


On Thu, Jul 17, 2014 at 02:28:32PM +0300, Kirill A. Shutemov wrote:
> On Wed, Jul 09, 2014 at 03:22:22PM +0900, Minchan Kim wrote:
> > Linux doesn't have an ability to free pages lazy while other OS
> > already have been supported that named by madvise(MADV_FREE).
> >
> > The gain is clear that kernel can discard freed pages rather than
> > swapping out or OOM if memory pressure happens.
> >
> > Without memory pressure, freed pages would be reused by userspace
> > without another additional overhead(ex, page fault + allocation
> > + zeroing).
> >
> > How to work is following as.
> >
> > When madvise syscall is called, VM clears dirty bit of ptes of
> > the range. If memory pressure happens, VM checks dirty bit of
> > page table and if it found still "clean", it means it's a
> > "lazyfree pages" so VM could discard the page instead of swapping out.
> > Once there was store operation for the page before VM peek a page
> > to reclaim, dirty bit is set so VM can swap out the page instead of
> > discarding.
> >
> > Firstly, heavy users would be general allocators(ex, jemalloc,
> > tcmalloc and hope glibc supports it) and jemalloc/tcmalloc already
> > have supported the feature for other OS(ex, FreeBSD)
> >
> > barrios@blaptop:~/benchmark/ebizzy$ lscpu
> > Architecture: x86_64
> > CPU op-mode(s): 32-bit, 64-bit
> > Byte Order: Little Endian
> > CPU(s): 4
> > On-line CPU(s) list: 0-3
> > Thread(s) per core: 2
> > Core(s) per socket: 2
> > Socket(s): 1
> > NUMA node(s): 1
> > Vendor ID: GenuineIntel
> > CPU family: 6
> > Model: 42
> > Stepping: 7
> > CPU MHz: 2801.000
> > BogoMIPS: 5581.64
> > Virtualization: VT-x
> > L1d cache: 32K
> > L1i cache: 32K
> > L2 cache: 256K
> > L3 cache: 4096K
> > NUMA node0 CPU(s): 0-3
> >
> > ebizzy benchmark(./ebizzy -S 10 -n 512)
> >
> > vanilla-jemalloc MADV_free-jemalloc
> >
> > 1 thread
> > records: 10 records: 10
> > avg: 7682.10 avg: 15306.10
> > std: 62.35(0.81%) std: 347.99(2.27%)
> > max: 7770.00 max: 15622.00
> > min: 7598.00 min: 14772.00
> >
> > 2 thread
> > records: 10 records: 10
> > avg: 12747.50 avg: 24171.00
> > std: 792.06(6.21%) std: 895.18(3.70%)
> > max: 13337.00 max: 26023.00
> > min: 10535.00 min: 23152.00
> >
> > 4 thread
> > records: 10 records: 10
> > avg: 16474.60 avg: 33717.90
> > std: 1496.45(9.08%) std: 2008.97(5.96%)
> > max: 17877.00 max: 35958.00
> > min: 12224.00 min: 29565.00
> >
> > 8 thread
> > records: 10 records: 10
> > avg: 16778.50 avg: 33308.10
> > std: 825.53(4.92%) std: 1668.30(5.01%)
> > max: 17543.00 max: 36010.00
> > min: 14576.00 min: 29577.00
> >
> > 16 thread
> > records: 10 records: 10
> > avg: 20614.40 avg: 35516.30
> > std: 602.95(2.92%) std: 1283.65(3.61%)
> > max: 21753.00 max: 37178.00
> > min: 19605.00 min: 33217.00
> >
> > 32 thread
> > records: 10 records: 10
> > avg: 22771.70 avg: 36018.50
> > std: 598.94(2.63%) std: 1046.76(2.91%)
> > max: 24035.00 max: 37266.00
> > min: 22108.00 min: 34149.00
> >
> > In summary, MADV_FREE is about 2 time faster than MADV_DONTNEED.
> >
> > Cc: Michael Kerrisk <mtk.manpages@xxxxxxxxx>
> > Cc: Linux API <linux-api@xxxxxxxxxxxxxxx>
> > Cc: Hugh Dickins <hughd@xxxxxxxxxx>
> > Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
> > Cc: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxxxxxxx>
> > Cc: Mel Gorman <mgorman@xxxxxxx>
> > Cc: Jason Evans <je@xxxxxx>
> > Acked-by: Zhang Yanfei <zhangyanfei@xxxxxxxxxxxxxx>
> > Acked-by: Rik van Riel <riel@xxxxxxxxxx>
> > Signed-off-by: Minchan Kim <minchan@xxxxxxxxxx>
> > ---
> > include/linux/rmap.h | 9 ++-
> > include/linux/vm_event_item.h | 1 +
> > include/uapi/asm-generic/mman-common.h | 1 +
> > mm/madvise.c | 136 +++++++++++++++++++++++++++++++++
> > mm/rmap.c | 42 +++++++++-
> > mm/vmscan.c | 40 ++++++++--
> > mm/vmstat.c | 1 +
> > 7 files changed, 218 insertions(+), 12 deletions(-)
> >
> > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > index be574506e6a9..0ba377b97a38 100644
> > --- a/include/linux/rmap.h
> > +++ b/include/linux/rmap.h
> > @@ -75,6 +75,7 @@ enum ttu_flags {
> > TTU_UNMAP = 1, /* unmap mode */
> > TTU_MIGRATION = 2, /* migration mode */
> > TTU_MUNLOCK = 4, /* munlock mode */
> > + TTU_FREE = 8, /* free mode */
> >
> > TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */
> > TTU_IGNORE_ACCESS = (1 << 9), /* don't age */
> > @@ -181,7 +182,8 @@ static inline void page_dup_rmap(struct page *page)
> > * Called from mm/vmscan.c to handle paging out
> > */
> > int page_referenced(struct page *, int is_locked,
> > - struct mem_cgroup *memcg, unsigned long *vm_flags);
> > + struct mem_cgroup *memcg, unsigned long *vm_flags,
> > + int *is_dirty);
> >
> > #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
> >
> > @@ -260,9 +262,12 @@ int rmap_walk(struct page *page, struct rmap_walk_control *rwc);
> >
> > static inline int page_referenced(struct page *page, int is_locked,
> > struct mem_cgroup *memcg,
> > - unsigned long *vm_flags)
> > + unsigned long *vm_flags,
> > + int *is_pte_dirty)
> > {
> > *vm_flags = 0;
> > + if (is_pte_dirty)
> > + *is_pte_dirty = 0;
> > return 0;
> > }
> >
> > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> > index ced92345c963..e2d3fb1e9814 100644
> > --- a/include/linux/vm_event_item.h
> > +++ b/include/linux/vm_event_item.h
> > @@ -25,6 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> > FOR_ALL_ZONES(PGALLOC),
> > PGFREE, PGACTIVATE, PGDEACTIVATE,
> > PGFAULT, PGMAJFAULT,
> > + PGLAZYFREED,
> > FOR_ALL_ZONES(PGREFILL),
> > FOR_ALL_ZONES(PGSTEAL_KSWAPD),
> > FOR_ALL_ZONES(PGSTEAL_DIRECT),
> > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > index ddc3b36f1046..7a94102b7a02 100644
> > --- a/include/uapi/asm-generic/mman-common.h
> > +++ b/include/uapi/asm-generic/mman-common.h
> > @@ -34,6 +34,7 @@
> > #define MADV_SEQUENTIAL 2 /* expect sequential page references */
> > #define MADV_WILLNEED 3 /* will need these pages */
> > #define MADV_DONTNEED 4 /* don't need these pages */
> > +#define MADV_FREE 5 /* free pages only if memory pressure */
> >
> > /* common parameters: try to keep these consistent across architectures */
> > #define MADV_REMOVE 9 /* remove these pages & resources */
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 0938b30da4ab..55b42e5e32a3 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -19,6 +19,14 @@
> > #include <linux/blkdev.h>
> > #include <linux/swap.h>
> > #include <linux/swapops.h>
> > +#include <linux/mmu_notifier.h>
> > +
> > +#include <asm/tlb.h>
> > +
> > +struct madvise_free_private {
> > + struct vm_area_struct *vma;
> > + struct mmu_gather *tlb;
> > +};
> >
> > /*
> > * Any behaviour which results in changes to the vma->vm_flags needs to
> > @@ -31,6 +39,7 @@ static int madvise_need_mmap_write(int behavior)
> > case MADV_REMOVE:
> > case MADV_WILLNEED:
> > case MADV_DONTNEED:
> > + case MADV_FREE:
> > return 0;
> > default:
> > /* be safe, default to 1. list exceptions explicitly */
> > @@ -251,6 +260,124 @@ static long madvise_willneed(struct vm_area_struct *vma,
> > return 0;
> > }
> >
> > +static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> > + unsigned long end, struct mm_walk *walk)
> > +
> > +{
> > + struct madvise_free_private *fp = walk->private;
> > + struct mmu_gather *tlb = fp->tlb;
> > + struct mm_struct *mm = tlb->mm;
> > + struct vm_area_struct *vma = fp->vma;
> > + spinlock_t *ptl;
> > + pte_t *pte, ptent;
> > + struct page *page;
> > +
> > + split_huge_page_pmd(vma, addr, pmd);
> > + if (pmd_trans_unstable(pmd))
> > + return 0;
> > +
> > + pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> > + arch_enter_lazy_mmu_mode();
> > + for (; addr != end; pte++, addr += PAGE_SIZE) {
> > + ptent = *pte;
> > +
> > + if (!pte_present(ptent))
> > + continue;
> > +
> > + page = vm_normal_page(vma, addr, ptent);
> > + if (!page)
> > + continue;
> > +
> > + if (PageSwapCache(page)) {
> > + if (trylock_page(page)) {
> > + if (try_to_free_swap(page))
> > + ClearPageDirty(page);
> > + unlock_page(page);
>
> 'continue' for !try_to_free_swap(page) case?

Good catch.

>
> > + } else
> > + continue;
> > + }
> > +
> > + /*
> > + * Some of architecture(ex, PPC) don't update TLB
> > + * with set_pte_at and tlb_remove_tlb_entry so for
> > + * the portability, remap the pte with old|clean
> > + * after pte clearing.
> > + */
> > + ptent = ptep_get_and_clear_full(mm, addr, pte,
> > + tlb->fullmm);
> > + ptent = pte_mkold(ptent);
> > + ptent = pte_mkclean(ptent);
> > + set_pte_at(mm, addr, pte, ptent);
> > + tlb_remove_tlb_entry(tlb, pte, addr);
> > + }
> > + arch_leave_lazy_mmu_mode();
> > + pte_unmap_unlock(pte - 1, ptl);
> > + cond_resched();
> > + return 0;
> > +}
> > +
> > +static void madvise_free_page_range(struct mmu_gather *tlb,
> > + struct vm_area_struct *vma,
> > + unsigned long addr, unsigned long end)
> > +{
> > + struct madvise_free_private fp = {
> > + .vma = vma,
> > + .tlb = tlb,
> > + };
> > +
> > + struct mm_walk free_walk = {
> > + .pmd_entry = madvise_free_pte_range,
> > + .mm = vma->vm_mm,
> > + .private = &fp,
> > + };
> > +
> > + BUG_ON(addr >= end);
> > + tlb_start_vma(tlb, vma);
> > + walk_page_range(addr, end, &free_walk);
> > + tlb_end_vma(tlb, vma);
> > +}
> > +
> > +static int madvise_free_single_vma(struct vm_area_struct *vma,
> > + unsigned long start_addr, unsigned long end_addr)
> > +{
> > + unsigned long start, end;
> > + struct mm_struct *mm = vma->vm_mm;
> > + struct mmu_gather tlb;
> > +
> > + if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
>
> THP uses VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE to filter out
> 'special' VMAs. Looks reasonable here too.
>
> Otherwise:
>
> Acked-by: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx>

Thanks, Kirill!

>
> --
> Kirill A. Shutemov
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxxx For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>

--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/