Re: [PATCH 1/7] hugetlb, rmap: add reverse mapping for hugepage

From: Mel Gorman
Date: Fri May 14 2010 - 05:55:20 EST


On Fri, May 14, 2010 at 04:46:41PM +0900, Naoya Horiguchi wrote:
> On Thu, May 13, 2010 at 04:27:37PM +0100, Mel Gorman wrote:
> > On Thu, May 13, 2010 at 04:55:20PM +0900, Naoya Horiguchi wrote:
> > > While hugepage is not currently swappable, rmapping can be useful
> > > for memory error handler.
> > > Using rmap, memory error handler can collect processes affected
> > > by hugepage errors and unmap them to contain error's effect.
> > >
> >
> > As a verification point, can you ensure that the libhugetlbfs "make
> > func" tests complete successfully with this patch applied? It's also
> > important that there is no oddness in the Hugepage-related counters in
> > /proc/meminfo. I'm not in the position to test it now unfortunately as
> > I'm on the road.
>
> Yes. Thanks for the good test-set.
>
> Hmm. I failed libhugetlbfs test with a oops in "private mapped" test :(
>

That's not a disaster - it's what the regression test is for. I haven't
restarted the review in this case. I'll wait for another version that
passes those regression tests.

> <OOPS SNIP>
>
> Someone seems to call hugetlb_fault() with anon_vma == NULL.
> For more detail, I'm investigating it.
>

Sure.

> > > Current status of hugepage rmap differs depending on mapping mode:
> > > - for shared hugepage:
> > > we can collect processes using a hugepage through pagecache,
> > > but can not unmap the hugepage because of the lack of mapcount.
> > > - for privately mapped hugepage:
> > > we can neither collect processes nor unmap the hugepage.
> > >
> > > To realize hugepage rmapping, this patch introduces mapcount for
> > > shared/private-mapped hugepage and anon_vma for private-mapped hugepage.
> > >
> > > This patch can be the replacement of the following bug fix.
> > >
> >
> > Actually, you replace chunks but not all of that fix with this patch.
> > After this patch HUGETLB_POISON is never assigned but the definition still
> > exists in poison.h. You should also remove it if it is unnecessary.
>
> OK. I'll remove HUGETLB_POISON in the next post.
>

Thanks

> <SNIP>
> > For ordinary anon_vma's, there
> > is a chain of related vma's chained together via the anon_vma's. It's so
> > in the event of an unmapping, all the PTEs related to the page can be
> > found. Where are we doing the same here?
>
> Finding all processes using a hugepage is done by try_to_unmap() as usual.
> Among callers of this function, only memory error handler calls it for
> hugepage for now.

Ok, my bad, it's anon_vma_prepare that does most of the linkages.
However, there still appears to be logic missing between how anon rmap
pages are setup and hugetlb anon rmap pages. See __page_set_anon_rmap
for example and what it does with chains and compare it to
hugetlb_add_anon_rmap. There are some important differences.


> What this patch does is to enable try_to_unmap() to be called for hugepages
> by setting up anon_vma in hugetlb code.
>
> > I think what you're getting with this is the ability to unmap MAP_PRIVATE pages
> > from one process but if there are multiple processes, the second process could
> > still end up referencing the poisoned MAP_PRIVATE page. Is this accurate? Even
> > if it is, I guess it's still an improvement over what currently happens.
>
> Try_to_unmap_anon() runs for each vma belonging to the anon_vma associated
> with the error hugepage. So it works for multiple processes.
>

Yep, as long as anon_vma_prepare is called in all the correct cases. I
haven't double checked you have and will wait until you pin down why
anon_vma is NULL in the next version.

> > > +
> > > static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
> > > unsigned long address, pte_t *ptep, pte_t pte,
> > > struct page *pagecache_page)
> > > @@ -2348,6 +2371,12 @@ retry_avoidcopy:
> > > huge_ptep_clear_flush(vma, address, ptep);
> > > set_huge_pte_at(mm, address, ptep,
> > > make_huge_pte(vma, new_page, 1));
> > > + page_remove_rmap(old_page);
> > > + /*
> > > + * We need not call anon_vma_prepare() because anon_vma
> > > + * is already prepared when the process fork()ed.
> > > + */
> > > + hugepage_add_anon_rmap(new_page, vma, address);
> >
> > This means that the anon_vma is shared between parent and child even
> > after fork. Does this not mean that the behaviour of anon_vma differs
> > between the core VM and hugetlb?
>
> No. IIUC, anon_vma associated with (non-huge) anonymous page is also shared
> between parent and child until COW.
>

In the base page case, it does but where is a new anon_vma being
allocated here and the rmap moved with page_move_anon_rmap?

> > > /* Make the old page be freed below */
> > > new_page = old_page;
> > > }
> > > @@ -2450,7 +2479,11 @@ retry:
> > > spin_unlock(&inode->i_lock);
> > > } else {
> > > lock_page(page);
> > > - page->mapping = HUGETLB_POISON;
> > > + if (unlikely(anon_vma_prepare(vma))) {
> > > + ret = VM_FAULT_OOM;
> > > + goto backout_unlocked;
> > > + }
> > > + hugepage_add_anon_rmap(page, vma, address);
> >
> > Seems ok for private pages at least.
> >
> > > }
> > > }
> > >
> > > @@ -2479,6 +2512,13 @@ retry:
> > > && (vma->vm_flags & VM_SHARED)));
> > > set_huge_pte_at(mm, address, ptep, new_pte);
> > >
> > > + /*
> > > + * For privately mapped hugepage, _mapcount is incremented
> > > + * in hugetlb_cow(), so only increment for shared hugepage here.
> > > + */
> > > + if (vma->vm_flags & VM_MAYSHARE)
> > > + page_dup_rmap(page);
> > > +
> >
> > What happens when try_to_unmap_file is called on a hugetlb page?
>
> Try_to_unmap_file() is called for shared hugepages, so it tracks all vmas
> sharing one hugepage through pagecache pointed to by page->mapping,
> and sets all ptes into hwpoison swap entries instead of flushing them.
> Curiously file backed pte is changed to swap entry, but it's OK because
> hwpoison hugepage should not be touched afterward.
>

Grand.

> > > if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
> > > /* Optimization, do the COW without a second fault */
> > > ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page);
> > > diff --git v2.6.34-rc7/mm/rmap.c v2.6.34-rc7/mm/rmap.c
> > > index 0feeef8..58cd2f9 100644
> > > --- v2.6.34-rc7/mm/rmap.c
> > > +++ v2.6.34-rc7/mm/rmap.c
> > > @@ -56,6 +56,7 @@
> > > #include <linux/memcontrol.h>
> > > #include <linux/mmu_notifier.h>
> > > #include <linux/migrate.h>
> > > +#include <linux/hugetlb.h>
> > >
> > > #include <asm/tlbflush.h>
> > >
> > > @@ -326,6 +327,8 @@ vma_address(struct page *page, struct vm_area_struct *vma)
> > > pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> > > unsigned long address;
> > >
> > > + if (unlikely(is_vm_hugetlb_page(vma)))
> > > + pgoff = page->index << compound_order(page);
> >
> > Again, it would be nice to use hstate information if possible just so
> > how the pagesize is discovered is consistent.
>
> OK.
>
> > > address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> > > if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
> > > /* page should be within @vma mapping range */
> > > @@ -369,6 +372,12 @@ pte_t *page_check_address(struct page *page, struct mm_struct *mm,
> > > pte_t *pte;
> > > spinlock_t *ptl;
> > >
> > > + if (unlikely(PageHuge(page))) {
> > > + pte = huge_pte_offset(mm, address);
> > > + ptl = &mm->page_table_lock;
> > > + goto check;
> > > + }
> > > +
> > > pgd = pgd_offset(mm, address);
> > > if (!pgd_present(*pgd))
> > > return NULL;
> > > @@ -389,6 +398,7 @@ pte_t *page_check_address(struct page *page, struct mm_struct *mm,
> > > }
> > >
> > > ptl = pte_lockptr(mm, pmd);
> > > +check:
> > > spin_lock(ptl);
> > > if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
> > > *ptlp = ptl;
> > > @@ -873,6 +883,12 @@ void page_remove_rmap(struct page *page)
> > > page_clear_dirty(page);
> > > set_page_dirty(page);
> > > }
> > > + /*
> > > + * Mapping for Hugepages are not counted in NR_ANON_PAGES nor
> > > + * NR_FILE_MAPPED and no charged by memcg for now.
> > > + */
> > > + if (unlikely(PageHuge(page)))
> > > + return;
> > > if (PageAnon(page)) {
> > > mem_cgroup_uncharge_page(page);
> > > __dec_zone_page_state(page, NR_ANON_PAGES);
> >
> > I don't see anything obviously wrong with this but it's a bit rushed and
> > there are a few snarls that I pointed out above. I'd like to hear it passed
> > the libhugetlbfs regression tests for different sizes without any oddness
> > in the counters.
>
> Since there exists regression as described above, I'll fix it first of all.
>

Sure. As it is, the hugetlb parts of this patch to my eye are not ready yet
with some snags that need ironing out. That said, I see nothing fundamentally
wrong with the approach as such.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/