Re: [RFC PATCH 0/4] Restore change_pte optimization to its former glory

From: Andrea Arcangeli
Date: Fri Feb 01 2019 - 18:57:44 EST

Hello everyone,

On Thu, Jan 31, 2019 at 01:37:02PM -0500, Jerome Glisse wrote:
> From: Jérôme Glisse <jglisse@xxxxxxxxxx>
> This patchset is on top of my patchset to add context information to
> mmu notifier [1] you can find a branch with everything [2]. I have not
> tested it but i wanted to get the discussion started. I believe it is
> correct but i am not sure what kind of kvm test i can run to exercise
> this.
> The idea is that since kvm will invalidate the secondary MMUs within
> invalidate_range callback then the change_pte() optimization is lost.
> With this patchset everytime core mm is using set_pte_at_notify() and
> thus change_pte() get calls then we can ignore the invalidate_range
> callback altogether and only rely on change_pte callback.
> Note that this is only valid when either going from a read and write
> pte to a read only pte with same pfn, or from a read only pte to a
> read and write pte with different pfn. The other side of the story
> is that the primary mmu pte is clear with ptep_clear_flush_notify
> before the call to change_pte.

If it's cleared with ptep_clear_flush_notify, change_pte still won't
work. The above text needs updating with
"ptep_clear_flush". set_pte_at_notify is all about having
ptep_clear_flush only before it or it's the same as having a range
invalidate preceding it.

With regard to the code, wp_page_copy() needs
s/ptep_clear_flush_notify/ptep_clear_flush/ before set_pte_at_notify.

change_pte relies on the ptep_clear_flush preceding the
set_pte_at_notify that will make sure if the secondary MMU mapping
randomly disappears between ptep_clear_flush and set_pte_at_notify,
gup_fast will wait and block on the PT lock until after
set_pte_at_notify is completed before trying to re-establish a
secondary MMU mapping.

So then we've only to worry about what happens because we left the
secondary MMU mapping potentially intact despite we flushed the
primary MMU mapping with ptep_clear_flush (as opposed to
ptep_clear_flush_notify which would teardown the secondary MMU mapping

In you wording above at least the "with a different pfn" is
superflous. I think it's ok if the protection changes from read-only
to read-write and the pfn remains the same. Like when we takeover a
page because it's not shared anymore (fork child quit).

It's also ok to change pfn if the mapping is read-only and remains
read-only, this is what KSM does in replace_page.

The read-write to read-only case must not change pfn to avoid losing
coherency from the secondary MMU point of view. This isn't so much
about change_pte itself, but the fact that the page-copy generally
happens well before the pte mangling starts. This case never presents
itself in the code because KSM is first write protecting the page and
only later merging it, regardless of change_pte or not.

The important thing is that the secondary MMU must be updated first
(unlike the invalidates) to be sure the secondary MMU already points
to the new page when the pfn changes and the protection changes from
read-only to read-write (COW fault). The primary MMU cannot read/write
to the page anyway while we update the secondary MMU because we did
ptep_clear_flush() before calling set_pte_at_notify(). So this
ordering of "ptep_clear_flush; change_pte; set_pte_at" ensures
whenever the CPU can access the memory, the access is synchronous
with the secondary MMUs because they've all been updated already.

If (in set_pte_at_notify) we were to call change_pte() after
set_pte_at() what would happen is that the CPU could write to the page
through a TLB fill without page fault while the secondary MMUs still
read the old memory in the old readonly page.