Re: [PATCH] mm/mmu_notifier: restore set_pte_at_notify semantics

From: Jerome Glisse
Date: Wed Apr 02 2014 - 11:18:45 EST


On Wed, Apr 02, 2014 at 03:52:45PM +0300, Haggai Eran wrote:
> On 03/30/2014 11:33 PM, Jerome Glisse wrote:
> >On Wed, Jan 22, 2014 at 04:01:15PM +0200, Haggai Eran wrote:
> >>I'm worried about the following scenario:
> >>
> >>Given a read-only page, suppose one host thread (thread 1) writes to
> >>that page, and performs COW, but before it calls the
> >>mmu_notifier_invalidate_page_if_missing_change_pte function another host
> >>thread (thread 2) writes to the same page (this time without a page
> >>fault). Then we have a valid entry in the secondary page table to a
> >>stale page, and someone (thread 3) may read stale data from there.
> >>
> >>Here's a diagram that shows this scenario:
> >>
> >>Thread 1 | Thread 2 | Thread 3
> >>========================================================================
> >>do_wp_page(page 1) | |
> >> ... | |
> >> set_pte_at_notify | |
> >> ... | write to page 1 |
> >> | | stale access
> >> pte_unmap_unlock | |
> >> invalidate_page_if_missing_change_pte | |
> >>
> >>This is currently prevented by the use of the range start and range end
> >>notifiers.
> >>
> >>Do you agree that this scenario is possible with the new patch, or am I
> >>missing something?
> >>
> >I believe you are right, but of all the upstream user of the mmu_notifier
> >API only xen would suffer from this ie any user that do not have a proper
> >change_pte callback can see the bogus scenario you describe above.
> Yes. I sent our RDMA paging RFC patch-set on linux-rdma [1] last
> month, and it would also suffer from this scenario, but it's not
> upstream yet.
> >The issue i see is with user that want to/or might sleep when they are
> >invalidation the secondary page table. The issue being that change_pte is
> >call with the cpu page table locked (well at least for the affected pmd).
> >
> >I would rather keep the invalidate_range_start/end bracket around change_pte
> >and invalidate page. I think we can fix the kvm regression by other means.
> Perhaps another possibility would be to do the
> invalidate_range_start/end bracket only when the mmu_notifier is
> missing a change_pte implementation.

This would imply either to scan all mmu_notifier currently register or to
have a global flags for the mm to know if there is one mmu_notifier without
change_pte. Moreover this would means that kvm would remain "broken" if one
of the mmu notifier do not have the change_pte callback.

Solution i have in mind and is part of a patchset i am working on, just
involve passing along an enum value to mmu notifier callback. The enum
value would tell what are the exact event that actually triggered the
mmu notifier call (vmscan, migrate, ksm, ...). Knowing this kvm could then
simply ignore invalidate_range_start/end for event it knows it will get
a change_pte callback.

Cheers,
Jérôme

>
> Best regards,
> Haggai
>
> [1] http://www.spinics.net/lists/linux-rdma/msg18906.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/