Re: [PATCH] ksm: set anon_vma of first rmap_item of ksm page to page's anon_vma other than vma's anon_vma
From: Hugh Dickins
Date: Tue Jul 19 2016 - 20:42:38 EST
On Thu, 23 Jun 2016, zhouxianrong@xxxxxxxxxx wrote:
> From: z00281421 <z00281421@xxxxxxxxxxxxxxxxxxxx>
>
> set anon_vma of first rmap_item of ksm page to page's anon_vma
> other than vma's anon_vma so that we can lookup all the forked
s/other/rather/
> vma of kpage via reserve map. thus we can try_to_unmap ksm page
s/reserve/reverse/
> completely and reclaim or migrate the ksm page successfully and
> need not to merg other forked vma addresses of ksm page with
> building a rmap_item for it ever after.
>
> a forked more mapcount ksm page with partially merged vma addresses and
> a ksm page mapped into non-VM_MERGEABLE vma due to setting MADV_MERGEABLE
> on one of the forked vma can be unmapped completely by try_to_unmap.
Sorry, I found that very hard to understand; and I'm only now beginning
to understand you. And at last I realize that what you are *aiming for*
is a very good idea - thank you.
Whether it can be (quickly and simply) achieved, I'll have to give a lot
more thought to. But what is certain is that your current implementation
is broken (and I'm worried that your testing did not notice such errors).
Trying 4.7-rc7-mm1 with KSM under swapping load,
one machine gave me many
WARNING: CPU: N PID: NNNN at mm/rmap.c:412 .unlink_anon_vmas+0x144/0x1c8
that's VM_WARN_ON(anon_vma->degree);
and another machine gave me many
cache `anon_vma': double free detected
ending in slab_put_obj()'s kernel BUG at mm/slab.c:2647!
and another machine gave me
kernel BUG at mm/migrate.c:1019!
that's VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma,
page);
No such problems once I reverted your patch. So, definitely NAK to
your present patch, and Andrew please remove it from the mmotm tree.
I have not tried to work out exactly why this and that error occur,
because there is a more fundamental problem with your patch: more on
that below.
>
> Signed-off-by: z00281421 <z00281421@xxxxxxxxxxxxxxxxxxxx>
As I said before, that should be Signed-off-by: Real Name <emailaddress>
> ---
> mm/ksm.c | 19 ++++++++++++++++---
> 1 file changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 4786b41..6bacc08 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -971,11 +971,13 @@ out:
> * @page: the PageAnon page that we want to replace with kpage
> * @kpage: the PageKsm page that we want to map instead of page,
> * or NULL the first time when we want to use page as kpage.
> + * @anon_vma: output the anon_vma of page used as kpage
> *
> * This function returns 0 if the pages were merged, -EFAULT otherwise.
> */
> static int try_to_merge_one_page(struct vm_area_struct *vma,
> - struct page *page, struct page *kpage)
> + struct page *page, struct page *kpage,
> + struct anon_vma **anon_vma)
Nit: is it necessary to change try_to_merge_one_page()? I thought it
was good enough just to take a snapshot of page_anon_vma(page) before
calling it from try_to_merge_with_ksm_page(), then use that there.
It's true that doing it in here you capture anon_vma while the page
is locked, but I don't think that actually matters (page_move_anon_rmap
might race and change it, but it's a benign race).
> {
> pte_t orig_pte = __pte(0);
> int err = -EFAULT;
> @@ -1015,6 +1017,8 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
> * PageAnon+anon_vma to PageKsm+NULL stable_node:
> * stable_tree_insert() will update stable_node.
> */
> + if (anon_vma != NULL)
> + *anon_vma = page_anon_vma(page);
> set_page_stable_node(page, NULL);
> mark_page_accessed(page);
> /*
> @@ -1055,6 +1059,7 @@ static int try_to_merge_with_ksm_page(struct rmap_item *rmap_item,
> {
> struct mm_struct *mm = rmap_item->mm;
> struct vm_area_struct *vma;
> + struct anon_vma *anon_vma = NULL;
> int err = -EFAULT;
>
> down_read(&mm->mmap_sem);
> @@ -1062,7 +1067,7 @@ static int try_to_merge_with_ksm_page(struct rmap_item *rmap_item,
> if (!vma)
> goto out;
>
> - err = try_to_merge_one_page(vma, page, kpage);
> + err = try_to_merge_one_page(vma, page, kpage, &anon_vma);
> if (err)
> goto out;
>
> @@ -1070,7 +1075,10 @@ static int try_to_merge_with_ksm_page(struct rmap_item *rmap_item,
> remove_rmap_item_from_tree(rmap_item);
>
> /* Must get reference to anon_vma while still holding mmap_sem */
> - rmap_item->anon_vma = vma->anon_vma;
> + if (anon_vma != NULL)
> + rmap_item->anon_vma = anon_vma;
> + else
> + rmap_item->anon_vma = vma->anon_vma;
I had something like that in the patch I was trying (but never
completed or posted) last year. But it did not actually do much good,
not in my testing anyway. A much more significant effect was that an
rmap_item needed to locate the page for rmap later, could be removed
from the stable tree when its mm exited, leaving nothing behind to
locate those ptes which were forked from this one, but not yet scanned
by ksmd. So most of my incomplete patch was trying to deal with that.
A link to that thread is marc.info/?l=linux-mm&m=142907899327574&w=2
> get_anon_vma(vma->anon_vma);
> out:
> up_read(&mm->mmap_sem);
> @@ -1435,6 +1443,11 @@ static void cmp_and_merge_page(struct page *page, struct rmap_item *rmap_item)
>
> remove_rmap_item_from_tree(rmap_item);
>
> + if (kpage == page) {
> + put_page(kpage);
> + return;
> + }
> +
And there I think we come to what's so good and what's so bad with your
patch. You are trying to avoid doing unnecessary work on the forked
kpage: good. But since you don't attach its rmap_item to the stable
tree, you're making the effect I mentioned in my previous paragraph even
worse: you are entirely dependent on the original rmap_item, which might
get removed from the stable tree when its mm exits, leaving the forked
ptes impossible to locate thereafter - bad.
Avoiding repeated unnecessary work on the stable tree is good, not just
to save time when building the tree up, but also whenever searching it
thereafter. I had not fully realized this until very recently, but I'm
now afraid that the KSM rmap lookup (as it stands) can get frighteningly
inefficient with forked pages.
Because each rmap_item attached to the stable tree represents, not so
much the position of a pte, as the specification of an anon_vma lookup
for the pte. And the way it works on a forked page at present, it is
duplicating those lookups unnecessarily. Not so bad for try_to_unmap()
(when the search terminates once all ptes have been unmapped), but
terrible for page_referenced(), which advances to the search_new_forks
stage and does a full anon_vma lookup on every rmap_item for the page.
I wasn't thinking about that at all when I tried to put together last
year's patch: all I was thinking about was maximizing the ability of
the lookup to locate ptes (and checking how successful it is at that);
but now the inefficient lookup seems like a priority to think about.
Am I right to assume that efficient handling of forked pages is a
serious concern for your product, rather than just an interesting
academic exercise for your own amusement?
Hugh