Re: [PATCH v3 2/5] mm/hugetlb: take page table lock in follow_huge_pmd()

From: Naoya Horiguchi
Date: Tue Oct 07 2014 - 12:57:06 EST


Hi Hugh,

Sorry for the delayed response, I was off for vacation.

On Mon, Sep 29, 2014 at 09:32:20PM -0700, Hugh Dickins wrote:
> On Mon, 15 Sep 2014, Naoya Horiguchi wrote:
> > We have a race condition between move_pages() and freeing hugepages,
>
> I've been looking through these 5 today, and they're much better now,
> thank you. But a new concern below, and a minor correction to 3/5.
>
> > --- mmotm-2014-09-09-14-42.orig/mm/gup.c
> > +++ mmotm-2014-09-09-14-42/mm/gup.c
> > @@ -162,33 +162,16 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
> >
> > pmd = pmd_offset(pud, address);
> > if (pmd_none(*pmd))
> > return no_page_table(vma, flags);
> > - if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
> > - page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
> > - if (flags & FOLL_GET) {
> > - /*
> > - * Refcount on tail pages are not well-defined and
> > - * shouldn't be taken. The caller should handle a NULL
> > - * return when trying to follow tail pages.
> > - */
> > - if (PageHead(page))
> > - get_page(page);
> > - else
> > - page = NULL;
> > - }
> > - return page;
> > - }
> > + if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB)
> > + return follow_huge_pmd(mm, address, pmd, flags);
>
> This code here allows for pmd_none() and pmd_huge(), and for pmd_numa()
> and pmd_trans_huge() below; but makes no explicit allowance for !present
> migration and hwpoison entries.
>
> Is it assumed that the pmd_bad() test in follow_page_pte() will catch
> those?

Yes, it is now.

> But what of races?

The current patch is still racy when hugepage migrations from different
reasons (hotremove and mbind, for example) happen concurrently.
We need a fix.

> migration entries are highly volatile. And
> is it assumed that a migration entry cannot pass the pmd_huge() test?

Yes, _PAGE_PSE bit is always clear for migration/hwpoison entry, so they
can never pass pmd_huge() test for now.

> That may be true of x86 today, I'm not certain; but if the soft-dirty
> people catch up with the hugetlb-migration people, that might change
> (they #define _PAGE_SWP_SOFT_DIRTY _PAGE_PSE).

Yes, this problem is not visible now (note that currently _PAGE_SWP_SOFT_DIRTY
is never set on pmd because hugepage is never swapped out,)
but it's potential one.

>
> Why pmd_huge() does not itself test for present, I cannot say; but it
> probably didn't matter at all before hwpoison and migration were added.

Correct, so we need check _PAGE_PRESENT bit in x86_64 pmd_huge() now.
And we need do some proper actions if we find migration/hwpoison here.
To do this, adding another routine like huge_pmd_present() might be useful
(pmd_present() is already used for thp.)

>
> Mind you, with __get_user_pages()'s is_vm_hugtlb_page() test avoiding
> all this code, maybe the only thing that can stumble here is your own
> hugetlb migration code; but that appears to be guarded only by
> down_read of mmap_sem, so races would be possible (if userspace
> is silly enough or malicious enough to do so).

I guess that the race is fixed by this patch with checking _PAGE_PRESENT
in pmd_huge(). Or are you mentioning another race?

>
> What we have here today looks too fragile to me, but it's probably
> best dealt with by a separate patch.
>
> Or I may be over-anxious, and there may be something "obvious"
> that I'm missing, which saves us from further change.

No, you found a new issue in the current code, thank you very much.

> > if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
> > return no_page_table(vma, flags);
> > if (pmd_trans_huge(*pmd)) {
> > diff --git mmotm-2014-09-09-14-42.orig/mm/hugetlb.c mmotm-2014-09-09-14-42/mm/hugetlb.c
> > index 34351251e164..941832ee3d5a 100644
> > --- mmotm-2014-09-09-14-42.orig/mm/hugetlb.c
> > +++ mmotm-2014-09-09-14-42/mm/hugetlb.c
> > @@ -3668,26 +3668,34 @@ follow_huge_addr(struct mm_struct *mm, unsigned long address,
> >
> > struct page * __weak
> > follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> > - pmd_t *pmd, int write)
> > + pmd_t *pmd, int flags)
> > {
> > - struct page *page;
> > + struct page *page = NULL;
> > + spinlock_t *ptl;
> >
> > - page = pte_page(*(pte_t *)pmd);
> > - if (page)
> > - page += ((address & ~PMD_MASK) >> PAGE_SHIFT);
> > + ptl = pmd_lockptr(mm, pmd);
> > + spin_lock(ptl);
> > +
> > + if (!pmd_huge(*pmd))
> > + goto out;
>
> And similarly here. Though at least here we now have the necessary
> lock, so it's no longer racy, and maybe this pmd_huge() test just needs
> to be replaced by a pmd_present() test? Or are both needed?

This check is introduced because the first pmd_huge() check outside
follow_huge_pmd() is called without page table lock. So keeping it to
recheck after holding lock looks correct to me.
But as I mentioned above, I'm thinking of changing x86_64's pmd_huge to
check both _PAGE_PRESENT and _PAGE_PSE to make sure that *pmd is pointing
to a normal hugepage, so this check is internally changed to check both.

Thanks,
Naoya Horiguchi

> > +
> > + page = pte_page(*(pte_t *)pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
> > +
> > + if (flags & FOLL_GET)
> > + get_page(page);
> > +out:
> > + spin_unlock(ptl);
> > return page;
> > }
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/