Re: [PATCH 05/19] mm, mpol: Create special PROT_NONE infrastructure

From: Andrea Arcangeli
Date: Thu Aug 09 2012 - 17:43:35 EST


On Tue, Jul 31, 2012 at 09:12:09PM +0200, Peter Zijlstra wrote:
> +static bool pte_prot_none(struct vm_area_struct *vma, pte_t pte)
> +{
> + /*
> + * If we have the normal vma->vm_page_prot protections we're not a
> + * 'special' PROT_NONE page.
> + *
> + * This means we cannot get 'special' PROT_NONE faults from genuine
> + * PROT_NONE maps, nor from PROT_WRITE file maps that do dirty
> + * tracking.
> + *
> + * Neither case is really interesting for our current use though so we
> + * don't care.
> + */
> + if (pte_same(pte, pte_modify(pte, vma->vm_page_prot)))
> + return false;
> +
> + return pte_same(pte, pte_modify(pte, vma_prot_none(vma)));
> +}
> @@ -3453,6 +3518,9 @@ int handle_pte_fault(struct mm_struct *m
> pte, pmd, flags, entry);
> }
>
> + if (pte_prot_none(vma, entry))
> + return do_prot_none(mm, vma, address, pte, pmd, flags, entry);
> +
> ptl = pte_lockptr(mm, pmd);
> spin_lock(ptl);
> if (unlikely(!pte_same(*pte, entry)))

I recommend calling it pte_numa, not pte_prot_none(), given you return
true only when it's not the real prot none.

Also I'd leave the details fully hidden in arch code, there's no need
to expose those to common code and force all archs to use the
PROT_NONE bitflag to implement numa hinting page faults. If an arch
wants to avoid touching the vma cacheline to avoid wasting time
computing the vma->vma_page_prot vs pteval, it can use a bitflag
different than protnone like AutoNUMA is currently doing. No reason to
force the use of PROT_NONE at the common code level.

My current implementation of the numa hinting page fault follows:

spin_lock(ptl);
if (unlikely(!pte_same(*pte, entry)))
goto unlock;
entry = pte_numa_fixup(mm, vma, address, entry, pte);

static inline pte_t pte_numa_fixup(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long addr, pte_t pte, pte_t *ptep)
{
if (pte_numa(pte))
pte = __pte_numa_fixup(mm, vma, addr, pte, ptep);
return pte;
}

I can easily change my entry point a bit to call it before taking the
spinlocks like you're doing to accomodate for your sync migration needs.

But I think it's better implemented like above, by just passing the
vma along until it reaches pte_numa(pte, vma) should be enough.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/