Re: [PATCH 30 of 66] transparent hugepage core

From: Mel Gorman
Date: Thu Nov 18 2010 - 10:12:43 EST


On Wed, Nov 03, 2010 at 04:28:05PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@xxxxxxxxxx>
>
> Lately I've been working to make KVM use hugepages transparently
> without the usual restrictions of hugetlbfs. Some of the restrictions
> I'd like to see removed:
>
> 1) hugepages have to be swappable or the guest physical memory remains
> locked in RAM and can't be paged out to swap
>
> 2) if a hugepage allocation fails, regular pages should be allocated
> instead and mixed in the same vma without any failure and without
> userland noticing
>
> 3) if some task quits and more hugepages become available in the
> buddy, guest physical memory backed by regular pages should be
> relocated on hugepages automatically in regions under
> madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
> kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
> not null)
>
> 4) avoidance of reservation and maximization of use of hugepages whenever
> possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
> 1 machine with 1 database with 1 database cache with 1 database cache size
> known at boot time. It's definitely not feasible with a virtualization
> hypervisor usage like RHEV-H that runs an unknown number of virtual machines
> with an unknown size of each virtual machine with an unknown amount of
> pagecache that could be potentially useful in the host for guest not using
> O_DIRECT (aka cache=off).
>
> hugepages in the virtualization hypervisor (and also in the guest!) are
> much more important than in a regular host not using virtualization, becasue
> with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in
> case only the hypervisor uses transparent hugepages, and they decrease the
> tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and
> the linux guest both uses this patch (though the guest will limit the addition
> speedup to anonymous regions only for now...). Even more important is that the
> tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow
> paging or no-virtualization scenario. So maximizing the amount of virtual
> memory cached by the TLB pays off significantly more with NPT/EPT than without
> (even if there would be no significant speedup in the tlb-miss runtime).
>
> The first (and more tedious) part of this work requires allowing the VM to
> handle anonymous hugepages mixed with regular pages transparently on regular
> anonymous vmas. This is what this patch tries to achieve in the least intrusive
> possible way. We want hugepages and hugetlb to be used in a way so that all
> applications can benefit without changes (as usual we leverage the KVM
> virtualization design: by improving the Linux VM at large, KVM gets the
> performance boost too).
>
> The most important design choice is: always fallback to 4k allocation
> if the hugepage allocation fails! This is the _very_ opposite of some
> large pagecache patches that failed with -EIO back then if a 64k (or
> similar) allocation failed...
>
> Second important decision (to reduce the impact of the feature on the
> existing pagetable handling code) is that at any time we can split an
> hugepage into 512 regular pages and it has to be done with an
> operation that can't fail. This way the reliability of the swapping
> isn't decreased (no need to allocate memory when we are short on
> memory to swap) and it's trivial to plug a split_huge_page* one-liner
> where needed without polluting the VM. Over time we can teach
> mprotect, mremap and friends to handle pmd_trans_huge natively without
> calling split_huge_page*. The fact it can't fail isn't just for swap:
> if split_huge_page would return -ENOMEM (instead of the current void)
> we'd need to rollback the mprotect from the middle of it (ideally
> including undoing the split_vma) which would be a big change and in
> the very wrong direction (it'd likely be simpler not to call
> split_huge_page at all and to teach mprotect and friends to handle
> hugepages instead of rolling them back from the middle). In short the
> very value of split_huge_page is that it can't fail.
>
> The collapsing and madvise(MADV_HUGEPAGE) part will remain separated
> and incremental and it'll just be an "harmless" addition later if this
> initial part is agreed upon. It also should be noted that locking-wise
> replacing regular pages with hugepages is going to be very easy if
> compared to what I'm doing below in split_huge_page, as it will only
> happen when page_count(page) matches page_mapcount(page) if we can
> take the PG_lock and mmap_sem in write mode. collapse_huge_page will
> be a "best effort" that (unlike split_huge_page) can fail at the
> minimal sign of trouble and we can try again later. collapse_huge_page
> will be similar to how KSM works and the madvise(MADV_HUGEPAGE) will
> work similar to madvise(MADV_MERGEABLE).
>
> The default I like is that transparent hugepages are used at page fault time.
> This can be changed with /sys/kernel/mm/transparent_hugepage/enabled. The
> control knob can be set to three values "always", "madvise", "never" which
> mean respectively that hugepages are always used, or only inside
> madvise(MADV_HUGEPAGE) regions, or never used.
> /sys/kernel/mm/transparent_hugepage/defrag instead controls if the hugepage
> allocation should defrag memory aggressively "always", only inside "madvise"
> regions, or "never".
>
> The pmd_trans_splitting/pmd_trans_huge locking is very solid. The
> put_page (from get_user_page users that can't use mmu notifier like
> O_DIRECT) that runs against a __split_huge_page_refcount instead was a
> pain to serialize in a way that would result always in a coherent page
> count for both tail and head. I think my locking solution with a
> compound_lock taken only after the page_first is valid and is still a
> PageHead should be safe but it surely needs review from SMP race point
> of view. In short there is no current existing way to serialize the
> O_DIRECT final put_page against split_huge_page_refcount so I had to
> invent a new one (O_DIRECT loses knowledge on the mapping status by
> the time gup_fast returns so...). And I didn't want to impact all
> gup/gup_fast users for now, maybe if we change the gup interface
> substantially we can avoid this locking, I admit I didn't think too
> much about it because changing the gup unpinning interface would be
> invasive.
>
> If we ignored O_DIRECT we could stick to the existing compound
> refcounting code, by simply adding a
> get_user_pages_fast_flags(foll_flags) where KVM (and any other mmu
> notifier user) would call it without FOLL_GET (and if FOLL_GET isn't
> set we'd just BUG_ON if nobody registered itself in the current task
> mmu notifier list yet). But O_DIRECT is fundamental for decent
> performance of virtualized I/O on fast storage so we can't avoid it to
> solve the race of put_page against split_huge_page_refcount to achieve
> a complete hugepage feature for KVM.
>
> Swap and oom works fine (well just like with regular pages ;). MMU
> notifier is handled transparently too, with the exception of the young
> bit on the pmd, that didn't have a range check but I think KVM will be
> fine because the whole point of hugepages is that EPT/NPT will also
> use a huge pmd when they notice gup returns pages with PageCompound set,
> so they won't care of a range and there's just the pmd young bit to
> check in that case.
>
> NOTE: in some cases if the L2 cache is small, this may slowdown and
> waste memory during COWs because 4M of memory are accessed in a single
> fault instead of 8k (the payoff is that after COW the program can run
> faster). So we might want to switch the copy_huge_page (and
> clear_huge_page too) to not temporal stores. I also extensively
> researched ways to avoid this cache trashing with a full prefault
> logic that would cow in 8k/16k/32k/64k up to 1M (I can send those
> patches that fully implemented prefault) but I concluded they're not
> worth it and they add an huge additional complexity and they remove all tlb
> benefits until the full hugepage has been faulted in, to save a little bit of
> memory and some cache during app startup, but they still don't improve
> substantially the cache-trashing during startup if the prefault happens in >4k
> chunks. One reason is that those 4k pte entries copied are still mapped on a
> perfectly cache-colored hugepage, so the trashing is the worst one can generate
> in those copies (cow of 4k page copies aren't so well colored so they trashes
> less, but again this results in software running faster after the page fault).
> Those prefault patches allowed things like a pte where post-cow pages were
> local 4k regular anon pages and the not-yet-cowed pte entries were pointing in
> the middle of some hugepage mapped read-only. If it doesn't payoff
> substantially with todays hardware it will payoff even less in the future with
> larger l2 caches, and the prefault logic would blot the VM a lot. If one is
> emebdded transparent_hugepage can be disabled during boot with sysfs or with
> the boot commandline parameter transparent_hugepage=0 (or
> transparent_hugepage=2 to restrict hugepages inside madvise regions) that will
> ensure not a single hugepage is allocated at boot time. It is simple enough to
> just disable transparent hugepage globally and let transparent hugepages be
> allocated selectively by applications in the MADV_HUGEPAGE region (both at page
> fault time, and if enabled with the collapse_huge_page too through the kernel
> daemon).
>
> This patch supports only hugepages mapped in the pmd, archs that have
> smaller hugepages will not fit in this patch alone. Also some archs like power
> have certain tlb limits that prevents mixing different page size in the same
> regions so they will not fit in this framework that requires "graceful
> fallback" to basic PAGE_SIZE in case of physical memory fragmentation.
> hugetlbfs remains a perfect fit for those because its software limits happen to
> match the hardware limits. hugetlbfs also remains a perfect fit for hugepage
> sizes like 1GByte that cannot be hoped to be found not fragmented after a
> certain system uptime and that would be very expensive to defragment with
> relocation, so requiring reservation. hugetlbfs is the "reservation way", the
> point of transparent hugepages is not to have any reservation at all and
> maximizing the use of cache and hugepages at all times automatically.
>
> Some performance result:
>
> vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
> ages3
> memset page fault 1566023
> memset tlb miss 453854
> memset second tlb miss 453321
> random access tlb miss 41635
> random access second tlb miss 41658
> vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
> memset page fault 1566471
> memset tlb miss 453375
> memset second tlb miss 453320
> random access tlb miss 41636
> random access second tlb miss 41637
> vmx andrea # ./largepages3
> memset page fault 1566642
> memset tlb miss 453417
> memset second tlb miss 453313
> random access tlb miss 41630
> random access second tlb miss 41647
> vmx andrea # ./largepages3
> memset page fault 1566872
> memset tlb miss 453418
> memset second tlb miss 453315
> random access tlb miss 41618
> random access second tlb miss 41659
> vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
> vmx andrea # ./largepages3
> memset page fault 2182476
> memset tlb miss 460305
> memset second tlb miss 460179
> random access tlb miss 44483
> random access second tlb miss 44186
> vmx andrea # ./largepages3
> memset page fault 2182791
> memset tlb miss 460742
> memset second tlb miss 459962
> random access tlb miss 43981
> random access second tlb miss 43988
>
> ============
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <sys/time.h>
>
> #define SIZE (3UL*1024*1024*1024)
>
> int main()
> {
> char *p = malloc(SIZE), *p2;
> struct timeval before, after;
>
> gettimeofday(&before, NULL);
> memset(p, 0, SIZE);
> gettimeofday(&after, NULL);
> printf("memset page fault %Lu\n",
> (after.tv_sec-before.tv_sec)*1000000UL +
> after.tv_usec-before.tv_usec);
>
> gettimeofday(&before, NULL);
> memset(p, 0, SIZE);
> gettimeofday(&after, NULL);
> printf("memset tlb miss %Lu\n",
> (after.tv_sec-before.tv_sec)*1000000UL +
> after.tv_usec-before.tv_usec);
>
> gettimeofday(&before, NULL);
> memset(p, 0, SIZE);
> gettimeofday(&after, NULL);
> printf("memset second tlb miss %Lu\n",
> (after.tv_sec-before.tv_sec)*1000000UL +
> after.tv_usec-before.tv_usec);
>
> gettimeofday(&before, NULL);
> for (p2 = p; p2 < p+SIZE; p2 += 4096)
> *p2 = 0;
> gettimeofday(&after, NULL);
> printf("random access tlb miss %Lu\n",
> (after.tv_sec-before.tv_sec)*1000000UL +
> after.tv_usec-before.tv_usec);
>
> gettimeofday(&before, NULL);
> for (p2 = p; p2 < p+SIZE; p2 += 4096)
> *p2 = 0;
> gettimeofday(&after, NULL);
> printf("random access second tlb miss %Lu\n",
> (after.tv_sec-before.tv_sec)*1000000UL +
> after.tv_usec-before.tv_usec);
>
> return 0;
> }
> ============
>

All that seems fine to me. Nits in part that are simply not worth
calling out. In principal, I Agree With This :)

> Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> Acked-by: Rik van Riel <riel@xxxxxxxxxx>
> Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> ---
> * * *
> adapt to mm_counter in -mm
>
> From: Andrea Arcangeli <aarcange@xxxxxxxxxx>
>
> The interface changed slightly.
>
> Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> Acked-by: Rik van Riel <riel@xxxxxxxxxx>
> ---
> * * *
> transparent hugepage bootparam
>
> From: Andrea Arcangeli <aarcange@xxxxxxxxxx>
>
> Allow transparent_hugepage=always|never|madvise at boot.
>
> Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> ---
>
> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
> --- a/arch/x86/include/asm/pgtable_64.h
> +++ b/arch/x86/include/asm/pgtable_64.h
> @@ -286,6 +286,11 @@ static inline pmd_t pmd_mkwrite(pmd_t pm
> return pmd_set_flags(pmd, _PAGE_RW);
> }
>
> +static inline pmd_t pmd_mknotpresent(pmd_t pmd)
> +{
> + return pmd_clear_flags(pmd, _PAGE_PRESENT);
> +}
> +
> #endif /* !__ASSEMBLY__ */
>
> #endif /* _ASM_X86_PGTABLE_64_H */
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -108,6 +108,9 @@ struct vm_area_struct;
> __GFP_HARDWALL | __GFP_HIGHMEM | \
> __GFP_MOVABLE)
> #define GFP_IOFS (__GFP_IO | __GFP_FS)
> +#define GFP_TRANSHUGE (GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
> + __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \
> + __GFP_NO_KSWAPD)
>
> #ifdef CONFIG_NUMA
> #define GFP_THISNODE (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> new file mode 100644
> --- /dev/null
> +++ b/include/linux/huge_mm.h
> @@ -0,0 +1,126 @@
> +#ifndef _LINUX_HUGE_MM_H
> +#define _LINUX_HUGE_MM_H
> +
> +extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
> + struct vm_area_struct *vma,
> + unsigned long address, pmd_t *pmd,
> + unsigned int flags);
> +extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> + pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> + struct vm_area_struct *vma);
> +extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> + unsigned long address, pmd_t *pmd,
> + pmd_t orig_pmd);
> +extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);
> +extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
> + unsigned long addr,
> + pmd_t *pmd,
> + unsigned int flags);
> +extern int zap_huge_pmd(struct mmu_gather *tlb,
> + struct vm_area_struct *vma,
> + pmd_t *pmd);
> +
> +enum transparent_hugepage_flag {
> + TRANSPARENT_HUGEPAGE_FLAG,
> + TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> + TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> + TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
> +#ifdef CONFIG_DEBUG_VM
> + TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
> +#endif
> +};
> +
> +enum page_check_address_pmd_flag {
> + PAGE_CHECK_ADDRESS_PMD_FLAG,
> + PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG,
> + PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG,
> +};
> +extern pmd_t *page_check_address_pmd(struct page *page,
> + struct mm_struct *mm,
> + unsigned long address,
> + enum page_check_address_pmd_flag flag);
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#define HPAGE_PMD_SHIFT HPAGE_SHIFT
> +#define HPAGE_PMD_MASK HPAGE_MASK
> +#define HPAGE_PMD_SIZE HPAGE_SIZE
> +
> +#define transparent_hugepage_enabled(__vma) \
> + (transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_FLAG) || \
> + (transparent_hugepage_flags & \
> + (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) && \
> + (__vma)->vm_flags & VM_HUGEPAGE))
> +#define transparent_hugepage_defrag(__vma) \
> + ((transparent_hugepage_flags & \
> + (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)) || \
> + (transparent_hugepage_flags & \
> + (1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG) && \
> + (__vma)->vm_flags & VM_HUGEPAGE))
> +#ifdef CONFIG_DEBUG_VM
> +#define transparent_hugepage_debug_cow() \
> + (transparent_hugepage_flags & \
> + (1<<TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG))
> +#else /* CONFIG_DEBUG_VM */
> +#define transparent_hugepage_debug_cow() 0
> +#endif /* CONFIG_DEBUG_VM */
> +
> +extern unsigned long transparent_hugepage_flags;
> +extern int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> + pmd_t *dst_pmd, pmd_t *src_pmd,
> + struct vm_area_struct *vma,
> + unsigned long addr, unsigned long end);
> +extern int handle_pte_fault(struct mm_struct *mm,
> + struct vm_area_struct *vma, unsigned long address,
> + pte_t *pte, pmd_t *pmd, unsigned int flags);
> +extern int split_huge_page(struct page *page);
> +extern void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd);
> +#define split_huge_page_pmd(__mm, __pmd) \
> + do { \
> + pmd_t *____pmd = (__pmd); \
> + if (unlikely(pmd_trans_huge(*____pmd))) \
> + __split_huge_page_pmd(__mm, ____pmd); \
> + } while (0)
> +#define wait_split_huge_page(__anon_vma, __pmd) \
> + do { \
> + pmd_t *____pmd = (__pmd); \
> + spin_unlock_wait(&(__anon_vma)->root->lock); \
> + /* \
> + * spin_unlock_wait() is just a loop in C and so the \
> + * CPU can reorder anything around it. \
> + */ \
> + smp_mb(); \

Just a note as I see nothing wrong with this but that's a good spot. The
unlock isn't a memory barrier. Out of curiousity, does it really need to be
a full barrier or would a write barrier have been enough?

> + BUG_ON(pmd_trans_splitting(*____pmd) || \
> + pmd_trans_huge(*____pmd)); \
> + } while (0)
> +#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
> +#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
> +#if HPAGE_PMD_ORDER > MAX_ORDER
> +#error "hugepages can't be allocated by the buddy allocator"
> +#endif
> +
> +extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
> +static inline int PageTransHuge(struct page *page)
> +{
> + VM_BUG_ON(PageTail(page));
> + return PageHead(page);
> +}

gfp.h seems an odd place for these. Should the flags go in page-flags.h
and maybe put vma_address() in internal.h?

Not a biggie.

> +#else /* CONFIG_TRANSPARENT_HUGEPAGE */
> +#define HPAGE_PMD_SHIFT ({ BUG(); 0; })
> +#define HPAGE_PMD_MASK ({ BUG(); 0; })
> +#define HPAGE_PMD_SIZE ({ BUG(); 0; })
> +
> +#define transparent_hugepage_enabled(__vma) 0
> +
> +#define transparent_hugepage_flags 0UL
> +static inline int split_huge_page(struct page *page)
> +{
> + return 0;
> +}
> +#define split_huge_page_pmd(__mm, __pmd) \
> + do { } while (0)
> +#define wait_split_huge_page(__anon_vma, __pmd) \
> + do { } while (0)
> +#define PageTransHuge(page) 0
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> +#endif /* _LINUX_HUGE_MM_H */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -111,6 +111,9 @@ extern unsigned int kobjsize(const void
> #define VM_SAO 0x20000000 /* Strong Access Ordering (powerpc) */
> #define VM_PFN_AT_MMAP 0x40000000 /* PFNMAP vma that is fully mapped at mmap time */
> #define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */
> +#if BITS_PER_LONG > 32
> +#define VM_HUGEPAGE 0x100000000UL /* MADV_HUGEPAGE marked this vma */
> +#endif
>
> #ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */
> #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
> @@ -240,6 +243,7 @@ struct inode;
> * files which need it (119 of them)
> */
> #include <linux/page-flags.h>
> +#include <linux/huge_mm.h>
>
> /*
> * Methods to modify the page usage count.
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -20,11 +20,18 @@ static inline int page_is_file_cache(str
> }
>
> static inline void
> +__add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
> + struct list_head *head)
> +{
> + list_add(&page->lru, head);
> + __inc_zone_state(zone, NR_LRU_BASE + l);
> + mem_cgroup_add_lru_list(page, l);
> +}
> +
> +static inline void
> add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
> {
> - list_add(&page->lru, &zone->lru[l].list);
> - __inc_zone_state(zone, NR_LRU_BASE + l);
> - mem_cgroup_add_lru_list(page, l);
> + __add_page_to_lru_list(zone, page, l, &zone->lru[l].list);
> }
>

Do these really need to be in a public header or can they move to
mm/swap.c?

> static inline void
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -208,6 +208,8 @@ extern unsigned int nr_free_pagecache_pa
> /* linux/mm/swap.c */
> extern void __lru_cache_add(struct page *, enum lru_list lru);
> extern void lru_cache_add_lru(struct page *, enum lru_list lru);
> +extern void lru_add_page_tail(struct zone* zone,
> + struct page *page, struct page *page_tail);
> extern void activate_page(struct page *);
> extern void mark_page_accessed(struct page *);
> extern void lru_add_drain(void);
> diff --git a/mm/Makefile b/mm/Makefile
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -42,3 +42,4 @@ obj-$(CONFIG_MEMORY_FAILURE) += memory-f
> obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
> obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
> obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
> +obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> new file mode 100644
> --- /dev/null
> +++ b/mm/huge_memory.c
> @@ -0,0 +1,899 @@
> +/*
> + * Copyright (C) 2009 Red Hat, Inc.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2. See
> + * the COPYING file in the top-level directory.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/sched.h>
> +#include <linux/highmem.h>
> +#include <linux/hugetlb.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/rmap.h>
> +#include <linux/swap.h>
> +#include <asm/tlb.h>
> +#include <asm/pgalloc.h>
> +#include "internal.h"
> +
> +unsigned long transparent_hugepage_flags __read_mostly =
> + (1<<TRANSPARENT_HUGEPAGE_FLAG);
> +
> +#ifdef CONFIG_SYSFS
> +static ssize_t double_flag_show(struct kobject *kobj,
> + struct kobj_attribute *attr, char *buf,
> + enum transparent_hugepage_flag enabled,
> + enum transparent_hugepage_flag req_madv)
> +{
> + if (test_bit(enabled, &transparent_hugepage_flags)) {
> + VM_BUG_ON(test_bit(req_madv, &transparent_hugepage_flags));
> + return sprintf(buf, "[always] madvise never\n");
> + } else if (test_bit(req_madv, &transparent_hugepage_flags))
> + return sprintf(buf, "always [madvise] never\n");
> + else
> + return sprintf(buf, "always madvise [never]\n");
> +}
> +static ssize_t double_flag_store(struct kobject *kobj,
> + struct kobj_attribute *attr,
> + const char *buf, size_t count,
> + enum transparent_hugepage_flag enabled,
> + enum transparent_hugepage_flag req_madv)
> +{
> + if (!memcmp("always", buf,
> + min(sizeof("always")-1, count))) {
> + set_bit(enabled, &transparent_hugepage_flags);
> + clear_bit(req_madv, &transparent_hugepage_flags);
> + } else if (!memcmp("madvise", buf,
> + min(sizeof("madvise")-1, count))) {
> + clear_bit(enabled, &transparent_hugepage_flags);
> + set_bit(req_madv, &transparent_hugepage_flags);
> + } else if (!memcmp("never", buf,
> + min(sizeof("never")-1, count))) {
> + clear_bit(enabled, &transparent_hugepage_flags);
> + clear_bit(req_madv, &transparent_hugepage_flags);
> + } else
> + return -EINVAL;
> +
> + return count;
> +}
> +
> +static ssize_t enabled_show(struct kobject *kobj,
> + struct kobj_attribute *attr, char *buf)
> +{
> + return double_flag_show(kobj, attr, buf,
> + TRANSPARENT_HUGEPAGE_FLAG,
> + TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
> +}
> +static ssize_t enabled_store(struct kobject *kobj,
> + struct kobj_attribute *attr,
> + const char *buf, size_t count)
> +{
> + return double_flag_store(kobj, attr, buf, count,
> + TRANSPARENT_HUGEPAGE_FLAG,
> + TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
> +}
> +static struct kobj_attribute enabled_attr =
> + __ATTR(enabled, 0644, enabled_show, enabled_store);
> +
> +static ssize_t single_flag_show(struct kobject *kobj,
> + struct kobj_attribute *attr, char *buf,
> + enum transparent_hugepage_flag flag)
> +{
> + if (test_bit(flag, &transparent_hugepage_flags))
> + return sprintf(buf, "[yes] no\n");
> + else
> + return sprintf(buf, "yes [no]\n");
> +}
> +static ssize_t single_flag_store(struct kobject *kobj,
> + struct kobj_attribute *attr,
> + const char *buf, size_t count,
> + enum transparent_hugepage_flag flag)
> +{
> + if (!memcmp("yes", buf,
> + min(sizeof("yes")-1, count))) {
> + set_bit(flag, &transparent_hugepage_flags);
> + } else if (!memcmp("no", buf,
> + min(sizeof("no")-1, count))) {
> + clear_bit(flag, &transparent_hugepage_flags);
> + } else
> + return -EINVAL;
> +
> + return count;
> +}
> +
> +/*
> + * Currently defrag only disables __GFP_NOWAIT for allocation. A blind
> + * __GFP_REPEAT is too aggressive, it's never worth swapping tons of
> + * memory just to allocate one more hugepage.
> + */
> +static ssize_t defrag_show(struct kobject *kobj,
> + struct kobj_attribute *attr, char *buf)
> +{
> + return double_flag_show(kobj, attr, buf,
> + TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> + TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
> +}
> +static ssize_t defrag_store(struct kobject *kobj,
> + struct kobj_attribute *attr,
> + const char *buf, size_t count)
> +{
> + return double_flag_store(kobj, attr, buf, count,
> + TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> + TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
> +}
> +static struct kobj_attribute defrag_attr =
> + __ATTR(defrag, 0644, defrag_show, defrag_store);
> +
> +#ifdef CONFIG_DEBUG_VM
> +static ssize_t debug_cow_show(struct kobject *kobj,
> + struct kobj_attribute *attr, char *buf)
> +{
> + return single_flag_show(kobj, attr, buf,
> + TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
> +}
> +static ssize_t debug_cow_store(struct kobject *kobj,
> + struct kobj_attribute *attr,
> + const char *buf, size_t count)
> +{
> + return single_flag_store(kobj, attr, buf, count,
> + TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
> +}
> +static struct kobj_attribute debug_cow_attr =
> + __ATTR(debug_cow, 0644, debug_cow_show, debug_cow_store);
> +#endif /* CONFIG_DEBUG_VM */
> +
> +static struct attribute *hugepage_attr[] = {
> + &enabled_attr.attr,
> + &defrag_attr.attr,
> +#ifdef CONFIG_DEBUG_VM
> + &debug_cow_attr.attr,
> +#endif
> + NULL,
> +};
> +
> +static struct attribute_group hugepage_attr_group = {
> + .attrs = hugepage_attr,
> + .name = "transparent_hugepage",
> +};
> +#endif /* CONFIG_SYSFS */
> +
> +static int __init hugepage_init(void)
> +{
> +#ifdef CONFIG_SYSFS
> + int err;
> +
> + err = sysfs_create_group(mm_kobj, &hugepage_attr_group);
> + if (err)
> + printk(KERN_ERR "hugepage: register sysfs failed\n");
> +#endif
> + return 0;
> +}
> +module_init(hugepage_init)
> +
> +static int __init setup_transparent_hugepage(char *str)
> +{
> + int ret = 0;
> + if (!str)
> + goto out;
> + if (!strcmp(str, "always")) {
> + set_bit(TRANSPARENT_HUGEPAGE_FLAG,
> + &transparent_hugepage_flags);
> + clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> + &transparent_hugepage_flags);
> + ret = 1;
> + } else if (!strcmp(str, "madvise")) {
> + clear_bit(TRANSPARENT_HUGEPAGE_FLAG,
> + &transparent_hugepage_flags);
> + set_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> + &transparent_hugepage_flags);
> + ret = 1;
> + } else if (!strcmp(str, "never")) {
> + clear_bit(TRANSPARENT_HUGEPAGE_FLAG,
> + &transparent_hugepage_flags);
> + clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> + &transparent_hugepage_flags);
> + ret = 1;
> + }
> +out:
> + if (!ret)
> + printk(KERN_WARNING
> + "transparent_hugepage= cannot parse, ignored\n");
> + return ret;
> +}
> +__setup("transparent_hugepage=", setup_transparent_hugepage);
> +
> +static void prepare_pmd_huge_pte(pgtable_t pgtable,
> + struct mm_struct *mm)
> +{
> + VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
> +

assert_spin_locked() ?

> + /* FIFO */
> + if (!mm->pmd_huge_pte)
> + INIT_LIST_HEAD(&pgtable->lru);
> + else
> + list_add(&pgtable->lru, &mm->pmd_huge_pte->lru);
> + mm->pmd_huge_pte = pgtable;
> +}
> +
> +static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
> +{
> + if (likely(vma->vm_flags & VM_WRITE))
> + pmd = pmd_mkwrite(pmd);
> + return pmd;
> +}
> +
> +static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
> + struct vm_area_struct *vma,
> + unsigned long haddr, pmd_t *pmd,
> + struct page *page)
> +{
> + int ret = 0;
> + pgtable_t pgtable;
> +
> + VM_BUG_ON(!PageCompound(page));
> + pgtable = pte_alloc_one(mm, haddr);
> + if (unlikely(!pgtable)) {
> + put_page(page);
> + return VM_FAULT_OOM;
> + }
> +
> + clear_huge_page(page, haddr, HPAGE_PMD_NR);
> + __SetPageUptodate(page);
> +
> + spin_lock(&mm->page_table_lock);
> + if (unlikely(!pmd_none(*pmd))) {
> + spin_unlock(&mm->page_table_lock);
> + put_page(page);
> + pte_free(mm, pgtable);
> + } else {
> + pmd_t entry;
> + entry = mk_pmd(page, vma->vm_page_prot);
> + entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> + entry = pmd_mkhuge(entry);
> + /*
> + * The spinlocking to take the lru_lock inside
> + * page_add_new_anon_rmap() acts as a full memory
> + * barrier to be sure clear_huge_page writes become
> + * visible after the set_pmd_at() write.
> + */
> + page_add_new_anon_rmap(page, vma, haddr);
> + set_pmd_at(mm, haddr, pmd, entry);
> + prepare_pmd_huge_pte(pgtable, mm);
> + add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
> + spin_unlock(&mm->page_table_lock);
> + }
> +
> + return ret;
> +}
> +
> +static inline struct page *alloc_hugepage(int defrag)
> +{
> + return alloc_pages(GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT),
> + HPAGE_PMD_ORDER);
> +}
> +
> +int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
> + unsigned long address, pmd_t *pmd,
> + unsigned int flags)
> +{
> + struct page *page;
> + unsigned long haddr = address & HPAGE_PMD_MASK;
> + pte_t *pte;
> +
> + if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) {
> + if (unlikely(anon_vma_prepare(vma)))
> + return VM_FAULT_OOM;
> + page = alloc_hugepage(transparent_hugepage_defrag(vma));
> + if (unlikely(!page))
> + goto out;
> +
> + return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page);
> + }
> +out:
> + /*
> + * Use __pte_alloc instead of pte_alloc_map, because we can't
> + * run pte_offset_map on the pmd, if an huge pmd could
> + * materialize from under us from a different thread.
> + */
> + if (unlikely(__pte_alloc(mm, vma, pmd, address)))
> + return VM_FAULT_OOM;
> + /* if an huge pmd materialized from under us just retry later */
> + if (unlikely(pmd_trans_huge(*pmd)))
> + return 0;
> + /*
> + * A regular pmd is established and it can't morph into a huge pmd
> + * from under us anymore at this point because we hold the mmap_sem
> + * read mode and khugepaged takes it in write mode. So now it's
> + * safe to run pte_offset_map().
> + */
> + pte = pte_offset_map(pmd, address);
> + return handle_pte_fault(mm, vma, address, pte, pmd, flags);
> +}
> +
> +int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> + pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> + struct vm_area_struct *vma)
> +{
> + struct page *src_page;
> + pmd_t pmd;
> + pgtable_t pgtable;
> + int ret;
> +
> + ret = -ENOMEM;
> + pgtable = pte_alloc_one(dst_mm, addr);
> + if (unlikely(!pgtable))
> + goto out;
> +
> + spin_lock(&dst_mm->page_table_lock);
> + spin_lock_nested(&src_mm->page_table_lock, SINGLE_DEPTH_NESTING);
> +
> + ret = -EAGAIN;
> + pmd = *src_pmd;
> + if (unlikely(!pmd_trans_huge(pmd))) {
> + pte_free(dst_mm, pgtable);
> + goto out_unlock;
> + }
> + if (unlikely(pmd_trans_splitting(pmd))) {
> + /* split huge page running from under us */
> + spin_unlock(&src_mm->page_table_lock);
> + spin_unlock(&dst_mm->page_table_lock);
> + pte_free(dst_mm, pgtable);
> +
> + wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
> + goto out;
> + }
> + src_page = pmd_page(pmd);
> + VM_BUG_ON(!PageHead(src_page));
> + get_page(src_page);
> + page_dup_rmap(src_page);
> + add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
> +
> + pmdp_set_wrprotect(src_mm, addr, src_pmd);
> + pmd = pmd_mkold(pmd_wrprotect(pmd));
> + set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> + prepare_pmd_huge_pte(pgtable, dst_mm);
> +
> + ret = 0;
> +out_unlock:
> + spin_unlock(&src_mm->page_table_lock);
> + spin_unlock(&dst_mm->page_table_lock);
> +out:
> + return ret;
> +}
> +
> +/* no "address" argument so destroys page coloring of some arch */
> +pgtable_t get_pmd_huge_pte(struct mm_struct *mm)
> +{
> + pgtable_t pgtable;
> +
> + VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
> +
> + /* FIFO */
> + pgtable = mm->pmd_huge_pte;
> + if (list_empty(&pgtable->lru))
> + mm->pmd_huge_pte = NULL;
> + else {
> + mm->pmd_huge_pte = list_entry(pgtable->lru.next,
> + struct page, lru);
> + list_del(&pgtable->lru);
> + }
> + return pgtable;
> +}
> +
> +static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
> + struct vm_area_struct *vma,
> + unsigned long address,
> + pmd_t *pmd, pmd_t orig_pmd,
> + struct page *page,
> + unsigned long haddr)
> +{
> + pgtable_t pgtable;
> + pmd_t _pmd;
> + int ret = 0, i;
> + struct page **pages;
> +
> + pages = kmalloc(sizeof(struct page *) * HPAGE_PMD_NR,
> + GFP_KERNEL);
> + if (unlikely(!pages)) {
> + ret |= VM_FAULT_OOM;
> + goto out;
> + }
> +
> + for (i = 0; i < HPAGE_PMD_NR; i++) {
> + pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
> + vma, address);
> + if (unlikely(!pages[i])) {
> + while (--i >= 0)
> + put_page(pages[i]);
> + kfree(pages);
> + ret |= VM_FAULT_OOM;
> + goto out;
> + }
> + }
> +
> + for (i = 0; i < HPAGE_PMD_NR; i++) {
> + copy_user_highpage(pages[i], page + i,
> + haddr + PAGE_SHIFT*i, vma);
> + __SetPageUptodate(pages[i]);
> + cond_resched();
> + }
> +
> + spin_lock(&mm->page_table_lock);
> + if (unlikely(!pmd_same(*pmd, orig_pmd)))
> + goto out_free_pages;
> + VM_BUG_ON(!PageHead(page));
> +
> + pmdp_clear_flush_notify(vma, haddr, pmd);
> + /* leave pmd empty until pte is filled */
> +
> + pgtable = get_pmd_huge_pte(mm);
> + pmd_populate(mm, &_pmd, pgtable);
> +
> + for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> + pte_t *pte, entry;
> + entry = mk_pte(pages[i], vma->vm_page_prot);
> + entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> + page_add_new_anon_rmap(pages[i], vma, haddr);
> + pte = pte_offset_map(&_pmd, haddr);
> + VM_BUG_ON(!pte_none(*pte));
> + set_pte_at(mm, haddr, pte, entry);
> + pte_unmap(pte);
> + }
> + kfree(pages);
> +
> + mm->nr_ptes++;
> + smp_wmb(); /* make pte visible before pmd */
> + pmd_populate(mm, pmd, pgtable);
> + page_remove_rmap(page);
> + spin_unlock(&mm->page_table_lock);
> +
> + ret |= VM_FAULT_WRITE;
> + put_page(page);
> +
> +out:
> + return ret;
> +
> +out_free_pages:
> + spin_unlock(&mm->page_table_lock);
> + for (i = 0; i < HPAGE_PMD_NR; i++)
> + put_page(pages[i]);
> + kfree(pages);
> + goto out;
> +}
> +
> +int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> + unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
> +{
> + int ret = 0;
> + struct page *page, *new_page;
> + unsigned long haddr;
> +
> + VM_BUG_ON(!vma->anon_vma);
> + spin_lock(&mm->page_table_lock);
> + if (unlikely(!pmd_same(*pmd, orig_pmd)))
> + goto out_unlock;
> +
> + page = pmd_page(orig_pmd);
> + VM_BUG_ON(!PageCompound(page) || !PageHead(page));
> + haddr = address & HPAGE_PMD_MASK;
> + if (page_mapcount(page) == 1) {
> + pmd_t entry;
> + entry = pmd_mkyoung(orig_pmd);
> + entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> + if (pmdp_set_access_flags(vma, haddr, pmd, entry, 1))
> + update_mmu_cache(vma, address, entry);
> + ret |= VM_FAULT_WRITE;
> + goto out_unlock;
> + }
> + get_page(page);
> + spin_unlock(&mm->page_table_lock);
> +
> + if (transparent_hugepage_enabled(vma) &&
> + !transparent_hugepage_debug_cow())
> + new_page = alloc_hugepage(transparent_hugepage_defrag(vma));
> + else
> + new_page = NULL;
> +
> + if (unlikely(!new_page)) {
> + ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
> + pmd, orig_pmd, page, haddr);
> + put_page(page);
> + goto out;
> + }
> +
> + copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
> + __SetPageUptodate(new_page);
> +
> + spin_lock(&mm->page_table_lock);
> + put_page(page);
> + if (unlikely(!pmd_same(*pmd, orig_pmd)))
> + put_page(new_page);
> + else {
> + pmd_t entry;
> + VM_BUG_ON(!PageHead(page));
> + entry = mk_pmd(new_page, vma->vm_page_prot);
> + entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> + entry = pmd_mkhuge(entry);
> + pmdp_clear_flush_notify(vma, haddr, pmd);
> + page_add_new_anon_rmap(new_page, vma, haddr);
> + set_pmd_at(mm, haddr, pmd, entry);
> + update_mmu_cache(vma, address, entry);
> + page_remove_rmap(page);
> + put_page(page);
> + ret |= VM_FAULT_WRITE;
> + }
> +out_unlock:
> + spin_unlock(&mm->page_table_lock);
> +out:
> + return ret;
> +}
> +
> +struct page *follow_trans_huge_pmd(struct mm_struct *mm,
> + unsigned long addr,
> + pmd_t *pmd,
> + unsigned int flags)
> +{
> + struct page *page = NULL;
> +
> + VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
> +
> + if (flags & FOLL_WRITE && !pmd_write(*pmd))
> + goto out;
> +
> + page = pmd_page(*pmd);
> + VM_BUG_ON(!PageHead(page));
> + if (flags & FOLL_TOUCH) {
> + pmd_t _pmd;
> + /*
> + * We should set the dirty bit only for FOLL_WRITE but
> + * for now the dirty bit in the pmd is meaningless.
> + * And if the dirty bit will become meaningful and
> + * we'll only set it with FOLL_WRITE, an atomic
> + * set_bit will be required on the pmd to set the
> + * young bit, instead of the current set_pmd_at.
> + */
> + _pmd = pmd_mkyoung(pmd_mkdirty(*pmd));
> + set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmd, _pmd);
> + }
> + page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
> + VM_BUG_ON(!PageCompound(page));
> + if (flags & FOLL_GET)
> + get_page(page);
> +
> +out:
> + return page;
> +}
> +
> +int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> + pmd_t *pmd)
> +{
> + int ret = 0;
> +
> + spin_lock(&tlb->mm->page_table_lock);
> + if (likely(pmd_trans_huge(*pmd))) {
> + if (unlikely(pmd_trans_splitting(*pmd))) {
> + spin_unlock(&tlb->mm->page_table_lock);
> + wait_split_huge_page(vma->anon_vma,
> + pmd);
> + } else {
> + struct page *page;
> + pgtable_t pgtable;
> + pgtable = get_pmd_huge_pte(tlb->mm);
> + page = pmd_page(*pmd);
> + pmd_clear(pmd);
> + page_remove_rmap(page);
> + VM_BUG_ON(page_mapcount(page) < 0);
> + add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> + VM_BUG_ON(!PageHead(page));
> + spin_unlock(&tlb->mm->page_table_lock);
> + tlb_remove_page(tlb, page);
> + pte_free(tlb->mm, pgtable);
> + ret = 1;
> + }
> + } else
> + spin_unlock(&tlb->mm->page_table_lock);
> +
> + return ret;
> +}
> +
> +pmd_t *page_check_address_pmd(struct page *page,
> + struct mm_struct *mm,
> + unsigned long address,
> + enum page_check_address_pmd_flag flag)
> +{
> + pgd_t *pgd;
> + pud_t *pud;
> + pmd_t *pmd, *ret = NULL;
> +
> + if (address & ~HPAGE_PMD_MASK)
> + goto out;
> +
> + pgd = pgd_offset(mm, address);
> + if (!pgd_present(*pgd))
> + goto out;
> +
> + pud = pud_offset(pgd, address);
> + if (!pud_present(*pud))
> + goto out;
> +
> + pmd = pmd_offset(pud, address);
> + if (pmd_none(*pmd))
> + goto out;
> + if (pmd_page(*pmd) != page)
> + goto out;
> + VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
> + pmd_trans_splitting(*pmd));
> + if (pmd_trans_huge(*pmd)) {
> + VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
> + !pmd_trans_splitting(*pmd));
> + ret = pmd;
> + }
> +out:
> + return ret;
> +}
> +
> +static int __split_huge_page_splitting(struct page *page,
> + struct vm_area_struct *vma,
> + unsigned long address)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + pmd_t *pmd;
> + int ret = 0;
> +
> + spin_lock(&mm->page_table_lock);
> + pmd = page_check_address_pmd(page, mm, address,
> + PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG);
> + if (pmd) {
> + /*
> + * We can't temporarily set the pmd to null in order
> + * to split it, the pmd must remain marked huge at all
> + * times or the VM won't take the pmd_trans_huge paths
> + * and it won't wait on the anon_vma->root->lock to
> + * serialize against split_huge_page*.
> + */
> + pmdp_splitting_flush_notify(vma, address, pmd);
> + ret = 1;
> + }
> + spin_unlock(&mm->page_table_lock);
> +
> + return ret;
> +}
> +
> +static void __split_huge_page_refcount(struct page *page)
> +{
> + int i;
> + unsigned long head_index = page->index;
> + struct zone *zone = page_zone(page);
> +
> + /* prevent PageLRU to go away from under us, and freeze lru stats */
> + spin_lock_irq(&zone->lru_lock);
> + compound_lock(page);
> +
> + for (i = 1; i < HPAGE_PMD_NR; i++) {
> + struct page *page_tail = page + i;
> +
> + /* tail_page->_count cannot change */
> + atomic_sub(atomic_read(&page_tail->_count), &page->_count);
> + BUG_ON(page_count(page) <= 0);
> + atomic_add(page_mapcount(page) + 1, &page_tail->_count);
> + BUG_ON(atomic_read(&page_tail->_count) <= 0);
> +
> + /* after clearing PageTail the gup refcount can be released */
> + smp_mb();
> +
> + page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> + page_tail->flags |= (page->flags &
> + ((1L << PG_referenced) |
> + (1L << PG_swapbacked) |
> + (1L << PG_mlocked) |
> + (1L << PG_uptodate)));
> + page_tail->flags |= (1L << PG_dirty);
> +
> + /*
> + * 1) clear PageTail before overwriting first_page
> + * 2) clear PageTail before clearing PageHead for VM_BUG_ON
> + */
> + smp_wmb();
> +
> + /*
> + * __split_huge_page_splitting() already set the
> + * splitting bit in all pmd that could map this
> + * hugepage, that will ensure no CPU can alter the
> + * mapcount on the head page. The mapcount is only
> + * accounted in the head page and it has to be
> + * transferred to all tail pages in the below code. So
> + * for this code to be safe, the split the mapcount
> + * can't change. But that doesn't mean userland can't
> + * keep changing and reading the page contents while
> + * we transfer the mapcount, so the pmd splitting
> + * status is achieved setting a reserved bit in the
> + * pmd, not by clearing the present bit.
> + */
> + BUG_ON(page_mapcount(page_tail));
> + page_tail->_mapcount = page->_mapcount;
> +
> + BUG_ON(page_tail->mapping);
> + page_tail->mapping = page->mapping;
> +
> + page_tail->index = ++head_index;
> +
> + BUG_ON(!PageAnon(page_tail));
> + BUG_ON(!PageUptodate(page_tail));
> + BUG_ON(!PageDirty(page_tail));
> + BUG_ON(!PageSwapBacked(page_tail));
> +
> + lru_add_page_tail(zone, page, page_tail);
> + }
> +
> + ClearPageCompound(page);
> + compound_unlock(page);
> + spin_unlock_irq(&zone->lru_lock);
> +
> + for (i = 1; i < HPAGE_PMD_NR; i++) {
> + struct page *page_tail = page + i;
> + BUG_ON(page_count(page_tail) <= 0);
> + /*
> + * Tail pages may be freed if there wasn't any mapping
> + * like if add_to_swap() is running on a lru page that
> + * had its mapping zapped. And freeing these pages
> + * requires taking the lru_lock so we do the put_page
> + * of the tail pages after the split is complete.
> + */
> + put_page(page_tail);
> + }
> +
> + /*
> + * Only the head page (now become a regular page) is required
> + * to be pinned by the caller.
> + */
> + BUG_ON(page_count(page) <= 0);
> +}
> +
> +static int __split_huge_page_map(struct page *page,
> + struct vm_area_struct *vma,
> + unsigned long address)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + pmd_t *pmd, _pmd;
> + int ret = 0, i;
> + pgtable_t pgtable;
> + unsigned long haddr;
> +
> + spin_lock(&mm->page_table_lock);
> + pmd = page_check_address_pmd(page, mm, address,
> + PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
> + if (pmd) {
> + pgtable = get_pmd_huge_pte(mm);
> + pmd_populate(mm, &_pmd, pgtable);
> +
> + for (i = 0, haddr = address; i < HPAGE_PMD_NR;
> + i++, haddr += PAGE_SIZE) {
> + pte_t *pte, entry;
> + BUG_ON(PageCompound(page+i));
> + entry = mk_pte(page + i, vma->vm_page_prot);
> + entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> + if (!pmd_write(*pmd))
> + entry = pte_wrprotect(entry);
> + else
> + BUG_ON(page_mapcount(page) != 1);
> + if (!pmd_young(*pmd))
> + entry = pte_mkold(entry);
> + pte = pte_offset_map(&_pmd, haddr);
> + BUG_ON(!pte_none(*pte));
> + set_pte_at(mm, haddr, pte, entry);
> + pte_unmap(pte);
> + }
> +
> + mm->nr_ptes++;
> + smp_wmb(); /* make pte visible before pmd */
> + /*
> + * Up to this point the pmd is present and huge and
> + * userland has the whole access to the hugepage
> + * during the split (which happens in place). If we
> + * overwrite the pmd with the not-huge version
> + * pointing to the pte here (which of course we could
> + * if all CPUs were bug free), userland could trigger
> + * a small page size TLB miss on the small sized TLB
> + * while the hugepage TLB entry is still established
> + * in the huge TLB. Some CPU doesn't like that. See
> + * http://support.amd.com/us/Processor_TechDocs/41322.pdf,
> + * Erratum 383 on page 93. Intel should be safe but is
> + * also warns that it's only safe if the permission
> + * and cache attributes of the two entries loaded in
> + * the two TLB is identical (which should be the case
> + * here). But it is generally safer to never allow
> + * small and huge TLB entries for the same virtual
> + * address to be loaded simultaneously. So instead of
> + * doing "pmd_populate(); flush_tlb_range();" we first
> + * mark the current pmd notpresent (atomically because
> + * here the pmd_trans_huge and pmd_trans_splitting
> + * must remain set at all times on the pmd until the
> + * split is complete for this pmd), then we flush the
> + * SMP TLB and finally we write the non-huge version
> + * of the pmd entry with pmd_populate.
> + */
> + set_pmd_at(mm, address, pmd, pmd_mknotpresent(*pmd));
> + flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> + pmd_populate(mm, pmd, pgtable);
> + ret = 1;
> + }
> + spin_unlock(&mm->page_table_lock);
> +
> + return ret;
> +}
> +
> +/* must be called with anon_vma->root->lock hold */
> +static void __split_huge_page(struct page *page,
> + struct anon_vma *anon_vma)
> +{
> + int mapcount, mapcount2;
> + struct anon_vma_chain *avc;
> +
> + BUG_ON(!PageHead(page));
> + BUG_ON(PageTail(page));
> +
> + mapcount = 0;
> + list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
> + struct vm_area_struct *vma = avc->vma;
> + unsigned long addr = vma_address(page, vma);
> + if (addr == -EFAULT)
> + continue;
> + mapcount += __split_huge_page_splitting(page, vma, addr);
> + }
> + BUG_ON(mapcount != page_mapcount(page));
> +
> + __split_huge_page_refcount(page);
> +
> + mapcount2 = 0;
> + list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
> + struct vm_area_struct *vma = avc->vma;
> + unsigned long addr = vma_address(page, vma);
> + if (addr == -EFAULT)
> + continue;
> + mapcount2 += __split_huge_page_map(page, vma, addr);
> + }
> + BUG_ON(mapcount != mapcount2);
> +}
> +
> +int split_huge_page(struct page *page)
> +{
> + struct anon_vma *anon_vma;
> + int ret = 1;
> +
> + BUG_ON(!PageAnon(page));
> + anon_vma = page_lock_anon_vma(page);
> + if (!anon_vma)
> + goto out;
> + ret = 0;
> + if (!PageCompound(page))
> + goto out_unlock;
> +
> + BUG_ON(!PageSwapBacked(page));
> + __split_huge_page(page, anon_vma);
> +
> + BUG_ON(PageCompound(page));
> +out_unlock:
> + page_unlock_anon_vma(anon_vma);
> +out:
> + return ret;
> +}
> +
> +void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd)
> +{
> + struct page *page;
> +
> + spin_lock(&mm->page_table_lock);
> + if (unlikely(!pmd_trans_huge(*pmd))) {
> + spin_unlock(&mm->page_table_lock);
> + return;
> + }
> + page = pmd_page(*pmd);
> + VM_BUG_ON(!page_count(page));
> + get_page(page);
> + spin_unlock(&mm->page_table_lock);
> +
> + split_huge_page(page);
> +
> + put_page(page);
> + BUG_ON(pmd_trans_huge(*pmd));
> +}
> diff --git a/mm/memory.c b/mm/memory.c
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -726,9 +726,9 @@ out_set_pte:
> return 0;
> }
>
> -static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> - pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
> - unsigned long addr, unsigned long end)
> +int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> + pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
> + unsigned long addr, unsigned long end)
> {
> pte_t *orig_src_pte, *orig_dst_pte;
> pte_t *src_pte, *dst_pte;
> @@ -802,6 +802,16 @@ static inline int copy_pmd_range(struct
> src_pmd = pmd_offset(src_pud, addr);
> do {
> next = pmd_addr_end(addr, end);
> + if (pmd_trans_huge(*src_pmd)) {
> + int err;
> + err = copy_huge_pmd(dst_mm, src_mm,
> + dst_pmd, src_pmd, addr, vma);
> + if (err == -ENOMEM)
> + return -ENOMEM;
> + if (!err)
> + continue;
> + /* fall through */
> + }
> if (pmd_none_or_clear_bad(src_pmd))
> continue;
> if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
> @@ -1004,6 +1014,15 @@ static inline unsigned long zap_pmd_rang
> pmd = pmd_offset(pud, addr);
> do {
> next = pmd_addr_end(addr, end);
> + if (pmd_trans_huge(*pmd)) {
> + if (next-addr != HPAGE_PMD_SIZE)
> + split_huge_page_pmd(vma->vm_mm, pmd);
> + else if (zap_huge_pmd(tlb, vma, pmd)) {
> + (*zap_work)--;
> + continue;
> + }
> + /* fall through */
> + }
> if (pmd_none_or_clear_bad(pmd)) {
> (*zap_work)--;
> continue;
> @@ -1280,11 +1299,27 @@ struct page *follow_page(struct vm_area_
> pmd = pmd_offset(pud, address);
> if (pmd_none(*pmd))
> goto no_page_table;
> - if (pmd_huge(*pmd)) {
> + if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
> BUG_ON(flags & FOLL_GET);
> page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
> goto out;
> }
> + if (pmd_trans_huge(*pmd)) {
> + spin_lock(&mm->page_table_lock);
> + if (likely(pmd_trans_huge(*pmd))) {
> + if (unlikely(pmd_trans_splitting(*pmd))) {
> + spin_unlock(&mm->page_table_lock);
> + wait_split_huge_page(vma->anon_vma, pmd);
> + } else {
> + page = follow_trans_huge_pmd(mm, address,
> + pmd, flags);
> + spin_unlock(&mm->page_table_lock);
> + goto out;
> + }
> + } else
> + spin_unlock(&mm->page_table_lock);
> + /* fall through */
> + }
> if (unlikely(pmd_bad(*pmd)))
> goto no_page_table;
>
> @@ -3141,9 +3176,9 @@ static int do_nonlinear_fault(struct mm_
> * but allow concurrent faults), and pte mapped but not yet locked.
> * We return with mmap_sem still held, but pte unmapped and unlocked.
> */
> -static inline int handle_pte_fault(struct mm_struct *mm,
> - struct vm_area_struct *vma, unsigned long address,
> - pte_t *pte, pmd_t *pmd, unsigned int flags)
> +int handle_pte_fault(struct mm_struct *mm,
> + struct vm_area_struct *vma, unsigned long address,
> + pte_t *pte, pmd_t *pmd, unsigned int flags)
> {
> pte_t entry;
> spinlock_t *ptl;
> @@ -3222,9 +3257,40 @@ int handle_mm_fault(struct mm_struct *mm
> pmd = pmd_alloc(mm, pud, address);
> if (!pmd)
> return VM_FAULT_OOM;
> - pte = pte_alloc_map(mm, vma, pmd, address);
> - if (!pte)
> + if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
> + if (!vma->vm_ops)
> + return do_huge_pmd_anonymous_page(mm, vma, address,
> + pmd, flags);
> + } else {
> + pmd_t orig_pmd = *pmd;
> + barrier();

What is this barrier for?

> + if (pmd_trans_huge(orig_pmd)) {
> + if (flags & FAULT_FLAG_WRITE &&
> + !pmd_write(orig_pmd) &&
> + !pmd_trans_splitting(orig_pmd))
> + return do_huge_pmd_wp_page(mm, vma, address,
> + pmd, orig_pmd);
> + return 0;
> + }
> + }
> +
> + /*
> + * Use __pte_alloc instead of pte_alloc_map, because we can't
> + * run pte_offset_map on the pmd, if an huge pmd could
> + * materialize from under us from a different thread.
> + */
> + if (unlikely(__pte_alloc(mm, vma, pmd, address)))
> return VM_FAULT_OOM;
> + /* if an huge pmd materialized from under us just retry later */
> + if (unlikely(pmd_trans_huge(*pmd)))
> + return 0;
> + /*
> + * A regular pmd is established and it can't morph into a huge pmd
> + * from under us anymore at this point because we hold the mmap_sem
> + * read mode and khugepaged takes it in write mode. So now it's
> + * safe to run pte_offset_map().
> + */
> + pte = pte_offset_map(pmd, address);
>
> return handle_pte_fault(mm, vma, address, pte, pmd, flags);
> }
> diff --git a/mm/rmap.c b/mm/rmap.c
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -360,7 +360,7 @@ void page_unlock_anon_vma(struct anon_vm
> * Returns virtual address or -EFAULT if page's index/offset is not
> * within the range mapped the @vma.
> */
> -static inline unsigned long
> +inline unsigned long
> vma_address(struct page *page, struct vm_area_struct *vma)
> {
> pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> @@ -435,6 +435,8 @@ pte_t *__page_check_address(struct page
> pmd = pmd_offset(pud, address);
> if (!pmd_present(*pmd))
> return NULL;
> + if (pmd_trans_huge(*pmd))
> + return NULL;
>
> pte = pte_offset_map(pmd, address);
> /* Make a quick check before getting the lock */
> @@ -489,35 +491,17 @@ int page_referenced_one(struct page *pag
> unsigned long *vm_flags)
> {
> struct mm_struct *mm = vma->vm_mm;
> - pte_t *pte;
> - spinlock_t *ptl;
> int referenced = 0;
>
> - pte = page_check_address(page, mm, address, &ptl, 0);
> - if (!pte)
> - goto out;
> -
> /*
> * Don't want to elevate referenced for mlocked page that gets this far,
> * in order that it progresses to try_to_unmap and is moved to the
> * unevictable list.
> */
> if (vma->vm_flags & VM_LOCKED) {
> - *mapcount = 1; /* break early from loop */
> + *mapcount = 0; /* break early from loop */
> *vm_flags |= VM_LOCKED;
> - goto out_unmap;
> - }
> -
> - if (ptep_clear_flush_young_notify(vma, address, pte)) {
> - /*
> - * Don't treat a reference through a sequentially read
> - * mapping as such. If the page has been used in
> - * another mapping, we will catch it; if this other
> - * mapping is already gone, the unmap path will have
> - * set PG_referenced or activated the page.
> - */
> - if (likely(!VM_SequentialReadHint(vma)))
> - referenced++;
> + goto out;
> }
>
> /* Pretend the page is referenced if the task has the
> @@ -526,9 +510,39 @@ int page_referenced_one(struct page *pag
> rwsem_is_locked(&mm->mmap_sem))
> referenced++;
>
> -out_unmap:
> + if (unlikely(PageTransHuge(page))) {
> + pmd_t *pmd;
> +
> + spin_lock(&mm->page_table_lock);
> + pmd = page_check_address_pmd(page, mm, address,
> + PAGE_CHECK_ADDRESS_PMD_FLAG);
> + if (pmd && !pmd_trans_splitting(*pmd) &&
> + pmdp_clear_flush_young_notify(vma, address, pmd))
> + referenced++;
> + spin_unlock(&mm->page_table_lock);
> + } else {
> + pte_t *pte;
> + spinlock_t *ptl;
> +
> + pte = page_check_address(page, mm, address, &ptl, 0);
> + if (!pte)
> + goto out;
> +
> + if (ptep_clear_flush_young_notify(vma, address, pte)) {
> + /*
> + * Don't treat a reference through a sequentially read
> + * mapping as such. If the page has been used in
> + * another mapping, we will catch it; if this other
> + * mapping is already gone, the unmap path will have
> + * set PG_referenced or activated the page.
> + */
> + if (likely(!VM_SequentialReadHint(vma)))
> + referenced++;
> + }
> + pte_unmap_unlock(pte, ptl);
> + }
> +
> (*mapcount)--;
> - pte_unmap_unlock(pte, ptl);
>
> if (referenced)
> *vm_flags |= vma->vm_flags;
> diff --git a/mm/swap.c b/mm/swap.c
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -465,6 +465,43 @@ void __pagevec_release(struct pagevec *p
>
> EXPORT_SYMBOL(__pagevec_release);
>
> +/* used by __split_huge_page_refcount() */
> +void lru_add_page_tail(struct zone* zone,
> + struct page *page, struct page *page_tail)
> +{
> + int active;
> + enum lru_list lru;
> + const int file = 0;
> + struct list_head *head;
> +
> + VM_BUG_ON(!PageHead(page));
> + VM_BUG_ON(PageCompound(page_tail));
> + VM_BUG_ON(PageLRU(page_tail));
> + VM_BUG_ON(!spin_is_locked(&zone->lru_lock));
> +
> + SetPageLRU(page_tail);
> +
> + if (page_evictable(page_tail, NULL)) {
> + if (PageActive(page)) {
> + SetPageActive(page_tail);
> + active = 1;
> + lru = LRU_ACTIVE_ANON;
> + } else {
> + active = 0;
> + lru = LRU_INACTIVE_ANON;
> + }
> + update_page_reclaim_stat(zone, page_tail, file, active);
> + if (likely(PageLRU(page)))
> + head = page->lru.prev;
> + else
> + head = &zone->lru[lru].list;
> + __add_page_to_lru_list(zone, page_tail, lru, head);
> + } else {
> + SetPageUnevictable(page_tail);
> + add_page_to_lru_list(zone, page_tail, LRU_UNEVICTABLE);
> + }
> +}
> +
> /*
> * Add the passed pages to the LRU, then drop the caller's refcount
> * on them. Reinitialises the caller's pagevec.
>

Other than a few minor questions, these seems very similar to what you
had before. There is a lot going on in this patch but I did not find
anything wrong.

Acked-by: Mel Gorman <mel@xxxxxxxxx>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/