Re: [BUG] vfio device assignment regression with THP ref counting redesign

From: Alex Williamson
Date: Thu Apr 28 2016 - 22:45:53 EST


On Fri, 29 Apr 2016 03:51:06 +0300
"Kirill A. Shutemov" <kirill@xxxxxxxxxxxxx> wrote:

> On Fri, Apr 29, 2016 at 01:21:27AM +0200, Andrea Arcangeli wrote:
> > Hello Alex and Kirill,
> >
> > On Thu, Apr 28, 2016 at 12:58:08PM -0600, Alex Williamson wrote:
> > > > > specific fix to this code is not applicable. It also still occurs on
> > > > > kernels as recent as v4.6-rc5, so the issue hasn't been silently fixed
> > > > > yet. I'm able to reproduce this fairly quickly with the above test,
> > > > > but it's not hard to imagine a test w/o any iommu dependencies which
> > > > > simply does a user directed get_user_pages_fast() on a set of userspace
> > > > > addresses, retains the reference, and at some point later rechecks that
> > > > > a new get_user_pages_fast() results in the same page address. It
> >
> > Can you try to "git revert 1f25fe20a76af0d960172fb104d4b13697cafa84"
> > and then apply the below patch on top of the revert?
> >
> > Totally untested... if I missed something and it isn't correct, I hope
> > this brings us in the right direction faster at least.
> >
> > Overall the problem I think is that we need to restore full accuracy
> > and we can't deal with false positive COWs (which aren't entirely
> > cheap either... reading 512 cachelines should be much faster than
> > copying 2MB and using 4MB of CPU cache). 32k vs 4MB. The problem of
> > course is when we really need a COW, we'll waste an additional 32k,
> > but then it doesn't matter that much as we'd be forced to load 4MB of
> > cache anyway in such case. There's room for optimizations but even the
> > simple below patch would be ok for now.
> >
> > From 09e3d1ff10b49fb9c3ab77f0b96a862848e30067 Mon Sep 17 00:00:00 2001
> > From: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> > Date: Fri, 29 Apr 2016 01:05:06 +0200
> > Subject: [PATCH 1/1] mm: thp: calculate page_mapcount() correctly for THP
> > pages
> >
> > This allows to revert commit 1f25fe20a76af0d960172fb104d4b13697cafa84
> > and it provides fully accuracy with wrprotect faults so page pinning
> > will stop causing false positive copy-on-writes.
> >
> > Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> > ---
> > mm/util.c | 5 +++--
> > 1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/util.c b/mm/util.c
> > index 6cc81e7..a0b9f63 100644
> > --- a/mm/util.c
> > +++ b/mm/util.c
> > @@ -383,9 +383,10 @@ struct address_space *page_mapping(struct page *page)
> > /* Slow path of page_mapcount() for compound pages */
> > int __page_mapcount(struct page *page)
> > {
> > - int ret;
> > + int ret = 0, i;
> >
> > - ret = atomic_read(&page->_mapcount) + 1;
> > + for (i = 0; i < HPAGE_PMD_NR; i++)
> > + ret = max(ret, atomic_read(&page->_mapcount) + 1);
> > page = compound_head(page);
> > ret += atomic_read(compound_mapcount_ptr(page)) + 1;
> > if (PageDoubleMap(page))
>
> You are right about the cause. I spend some time on wrong path: I was only
> able to trigger the bug with numa balancing enabled, so I assumed
> something is wrong in that code...
>
> I would like to preserve current page_mapcount() behaviouts.
> I think this fix is better:

This also seems to work in my testing, but assuming all else being
equal, there is a performance difference between the two for this test
case in favor of Andrea's solution. Modifying the test to exit after
the first set of iterations, my system takes on average 107s to complete
with the solution below or 103.5s with the other approach. Please note
that I have every mm debugging option I could find enabled and THP
scanning full speed on the system, so I don't know how this would play
out in a more tuned configuration.

The only reason I noticed is that I added a side test to sleep a random
number of seconds and kill the test program because sometimes killing
the test triggers errors. I didn't see any errors with either of these
solutions, but suspected the first solution was completing more
iterations for similar intervals. Modifying the test to exit seems to
prove that true.

I can't speak to which is the more architecturally correct solution,
but there may be a measurable performance difference to consider.
Thanks,

Alex

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 86f9f8b82f8e..163c10f48e1b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1298,15 +1298,9 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
> /*
> * We can only reuse the page if nobody else maps the huge page or it's
> - * part. We can do it by checking page_mapcount() on each sub-page, but
> - * it's expensive.
> - * The cheaper way is to check page_count() to be equal 1: every
> - * mapcount takes page reference reference, so this way we can
> - * guarantee, that the PMD is the only mapping.
> - * This can give false negative if somebody pinned the page, but that's
> - * fine.
> + * part.
> */
> - if (page_mapcount(page) == 1 && page_count(page) == 1) {
> + if (total_mapcount(page) == 1) {
> pmd_t entry;
> entry = pmd_mkyoung(orig_pmd);
> entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);