Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic

From: Andrea Arcangeli
Date: Wed Aug 30 2017 - 19:01:32 EST


On Wed, Aug 30, 2017 at 02:53:38PM -0700, Linus Torvalds wrote:
> On Wed, Aug 30, 2017 at 9:52 AM, Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote:
> >
> > I pointed out in earlier email ->invalidate_range can only be
> > implemented (as mutually exclusive alternative to
> > ->invalidate_range_start/end) by secondary MMUs that shares the very
> > same pagetables with the core linux VM of the primary MMU, and those
> > invalidate_range are already called by
> > __mmu_notifier_invalidate_range_end.
>
> I have to admit that I didn't notice that fact - that we are already
> in the situation that
> invalidate_range is called by by the rand_end() nofifier.
>
> I agree that that should simplify all the code, and means that we
> don't have to worry about the few cases that already implemented only
> the "invalidate_page()" and "invalidate_range()" cases.
>
> So I think that simplifies Jérôme's patch further - once you have put
> the range_start/end() cases around the inner loop, you can just drop
> the invalidate_page() things entirely.
>
> > So this conversion from invalidate_page to invalidate_range looks
> > superflous and the final mmu_notifier_invalidate_range_end should be
> > enough.
>
> Yes. I missed the fact that we already called range() from range_end().
>
> That said, the double call shouldn't hurt correctness, and it's
> "closer" to old behavior for those people who only did the range/page
> ones, so I wonder if we can keep Jérôme's patch in its current state
> for 4.13.

Yes, the double call doesn't hurt correctness. Keeping it in current
state is safer if something, so I've no objection to it other than I'd
like to optimize it further if possible, but it can be done later.

We're already running the double call in various fast paths too in
fact, and rmap walk isn't the fastest path that would be doing such
double call, so it's not a major concern.

Also not a bug, but one further (but more obviously safe) enhancement
I would like is to restrict those rmap invalidation ranges to
PAGE_SIZE << compound_order(page) instead of PMD_SIZE/PMD_MASK.

+ /*
+ * We have to assume the worse case ie pmd for invalidation. Note that
+ * the page can not be free in this function as call of try_to_unmap()
+ * must hold a reference on the page.
+ */
+ end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
+ mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);

We don't need to invalidate 2MB of secondary MMU mappings surrounding
a 4KB page, just to swapout a 4k page. split_huge_page can't run while
holding the rmap locks, so compound_order(page) is safe to use there.

It can also be optimized incrementally later.

> Because I still want to release 4.13 this weekend, despite this
> upheaval. Otherwise I'll have timing problems during the next merge
> window.
>
> Andrea, do you otherwise agree with the whole series as is?

I only wish we had more time to test Jerome's patchset, but I sure
agree in principle and I don't see regressions in it.

The callouts to ->invalidate_page seems to have diminished over time
(for the various reasons we know) so if we don't use it for the fast
paths, using it only in rmap walk slow paths probably wasn't providing
much performance benefit.

Thanks,
Andrea