Re: [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages

From: Jerome Glisse
Date: Wed Aug 29 2018 - 14:14:31 EST


On Wed, Aug 29, 2018 at 10:24:44AM -0700, Mike Kravetz wrote:
> On 08/27/2018 06:46 AM, Jerome Glisse wrote:
> > On Mon, Aug 27, 2018 at 09:46:45AM +0200, Michal Hocko wrote:
> >> On Fri 24-08-18 11:08:24, Mike Kravetz wrote:
> >>> Here is an updated patch which does as you suggest above.
> >> [...]
> >>> @@ -1409,6 +1419,32 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >>> subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
> >>> address = pvmw.address;
> >>>
> >>> + if (PageHuge(page)) {
> >>> + if (huge_pmd_unshare(mm, &address, pvmw.pte)) {
> >>> + /*
> >>> + * huge_pmd_unshare unmapped an entire PMD
> >>> + * page. There is no way of knowing exactly
> >>> + * which PMDs may be cached for this mm, so
> >>> + * we must flush them all. start/end were
> >>> + * already adjusted above to cover this range.
> >>> + */
> >>> + flush_cache_range(vma, start, end);
> >>> + flush_tlb_range(vma, start, end);
> >>> + mmu_notifier_invalidate_range(mm, start, end);
> >>> +
> >>> + /*
> >>> + * The ref count of the PMD page was dropped
> >>> + * which is part of the way map counting
> >>> + * is done for shared PMDs. Return 'true'
> >>> + * here. When there is no other sharing,
> >>> + * huge_pmd_unshare returns false and we will
> >>> + * unmap the actual page and drop map count
> >>> + * to zero.
> >>> + */
> >>> + page_vma_mapped_walk_done(&pvmw);
> >>> + break;
> >>> + }
> >>
> >> This still calls into notifier while holding the ptl lock. Either I am
> >> missing something or the invalidation is broken in this loop (not also
> >> for other invalidations).
> >
> > mmu_notifier_invalidate_range() is done with pt lock held only the start
> > and end versions need to happen outside pt lock.
>
> Hi Jérôme (and anyone else having good understanding of mmu notifier API),
>
> Michal and I have been looking at backports to stable releases. If you look
> at the v4.4 version of try_to_unmap_one(), it does not use the
> mmu_notifier_invalidate_range_start/end interfaces. Rather, it uses the
> mmu_notifier_invalidate_page(), passing in the address of the page it
> unmapped. This is done after releasing the ptl lock. I'm not even sure if
> this works for huge pages, as it appears some THP supporting code was added
> to try_to_unmap_one() after v4.4.
>
> But, we were wondering what mmu notifier interface to use in the case where
> try_to_unmap_one() unmaps a shared pmd huge page as addressed in the patch
> above. In this case, a PUD sized area is effectively unmapped. In the
> code/patch above we have the invalidate range (start and end as well) take
> the PUD sized area into account.
>
> What would be the best mmu notifier interface to use where there are no
> start/end calls?
> Or, is the best solution to add the start/end calls as is done in later
> versions of the code? If that is the suggestion, has there been any change
> in invalidate start/end semantics that we should take into account?

start/end would be the one to add, 4.4 seems broken in respect to THP
and mmu notification. Another solution is to fix user of mmu notifier,
they were only a handful back then. For instance properly adjust the
address to match first address covered by pmd or pud and passing down
correct page size to mmu_notifier_invalidate_page() would allow to fix
this easily.

This is ok because user of try_to_unmap_one() replace the pte/pmd/pud
with an invalid one (either poison, migration or swap) inside the
function. So anyone racing would synchronize on those special entry
hence why it is fine to delay mmu_notifier_invalidate_page() to after
dropping the page table lock.

Adding start/end might the solution with less code churn as you would
only need to change try_to_unmap_one().

Cheers,
Jérôme