Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case ofvma_merge succeeding in copy_vma

From: Nai Xia
Date: Tue Nov 01 2011 - 10:35:23 EST


On Tue, Nov 1, 2011 at 1:27 AM, Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote:
> migrate was doing a rmap_walk with speculative lock-less access on
> pagetables. That could lead it to not serialize properly against
> mremap PT locks. But a second problem remains in the order of vmas in
> the same_anon_vma list used by the rmap_walk.
>
> If vma_merge would succeed in copy_vma, the src vma could be placed
> after the dst vma in the same_anon_vma list. That could still lead
> migrate to miss some pte.
>
> This patch adds a anon_vma_order_tail() function to force the dst vma
> at the end of the list before mremap starts to solve the problem.
>
> If the mremap is very large and there are a lots of parents or childs
> sharing the anon_vma root lock, this should still scale better than
> taking the anon_vma root lock around every pte copy practically for
> the whole duration of mremap.
>
> Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> ---
>  include/linux/rmap.h |    1 +
>  mm/mmap.c            |    8 ++++++++
>  mm/rmap.c            |   44 ++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 53 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 2148b12..45eb098 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -120,6 +120,7 @@ void anon_vma_init(void);   /* create anon_vma_cachep */
>  int  anon_vma_prepare(struct vm_area_struct *);
>  void unlink_anon_vmas(struct vm_area_struct *);
>  int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
> +void anon_vma_order_tail(struct vm_area_struct *);
>  int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
>  void __anon_vma_link(struct vm_area_struct *);
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index a65efd4..a5858dc 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2339,7 +2339,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
>                 */
>                if (vma_start >= new_vma->vm_start &&
>                    vma_start < new_vma->vm_end)
> +                       /*
> +                        * No need to call anon_vma_order_tail() in
> +                        * this case because the same PT lock will
> +                        * serialize the rmap_walk against both src
> +                        * and dst vmas.
> +                        */
>                        *vmap = new_vma;
> +               else
> +                       anon_vma_order_tail(new_vma);
>        } else {
>                new_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
>                if (new_vma) {
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 8005080..6dbc165 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -272,6 +272,50 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
>  }
>
>  /*
> + * Some rmap walk that needs to find all ptes/hugepmds without false
> + * negatives (like migrate and split_huge_page) running concurrent
> + * with operations that copy or move pagetables (like mremap() and
> + * fork()) to be safe depends the anon_vma "same_anon_vma" list to be
> + * in a certain order: the dst_vma must be placed after the src_vma in
> + * the list. This is always guaranteed by fork() but mremap() needs to
> + * call this function to enforce it in case the dst_vma isn't newly
> + * allocated and chained with the anon_vma_clone() function but just
> + * an extension of a pre-existing vma through vma_merge.
> + *
> + * NOTE: the same_anon_vma list can still be changed by other
> + * processes while mremap runs because mremap doesn't hold the
> + * anon_vma mutex to prevent modifications to the list while it
> + * runs. All we need to enforce is that the relative order of this
> + * process vmas isn't changing (we don't care about other vmas
> + * order). Each vma corresponds to an anon_vma_chain structure so
> + * there's no risk that other processes calling anon_vma_order_tail()
> + * and changing the same_anon_vma list under mremap() will screw with
> + * the relative order of this process vmas in the list, because we
> + * won't alter the order of any vma that isn't belonging to this
> + * process. And there can't be another anon_vma_order_tail running
> + * concurrently with mremap() coming from this process because we hold
> + * the mmap_sem for the whole mremap(). fork() ordering dependency
> + * also shouldn't be affected because we only care that the parent
> + * vmas are placed in the list before the child vmas and
> + * anon_vma_order_tail won't reorder vmas from either the fork parent
> + * or child.
> + */
> +void anon_vma_order_tail(struct vm_area_struct *dst)
> +{
> +       struct anon_vma_chain *pavc;
> +       struct anon_vma *root = NULL;
> +
> +       list_for_each_entry_reverse(pavc, &dst->anon_vma_chain, same_vma) {
> +               struct anon_vma *anon_vma = pavc->anon_vma;
> +               VM_BUG_ON(pavc->vma != dst);
> +               root = lock_anon_vma_root(root, anon_vma);
> +               list_del(&pavc->same_anon_vma);
> +               list_add_tail(&pavc->same_anon_vma, &anon_vma->head);
> +       }
> +       unlock_anon_vma_root(root);
> +}

I think Pawel might want to sign a "Tested-by", he may have been running this
patch safely for quite some days. :)

Reviewed-by: Nai Xia <nai.xia@xxxxxxxxx>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/