Re: [PATCH 1/5] mm: Introduce mm_struct.has_pinned
From: Linus Torvalds
Date: Mon Sep 28 2020 - 13:54:50 EST
On Mon, Sep 28, 2020 at 10:23 AM Peter Xu <peterx@xxxxxxxxxx> wrote:
>
> Yes... Actually I am also thinking about the complete solution to cover
> read-only fast-gups too, but now I start to doubt this, at least for the fork()
> path. E.g. if we'd finally like to use pte_protnone() to replace the current
> pte_wrprotect(), we'll be able to also block the read gups, but we'll suffer
> the same degradation on normal fork()s, or even more. Seems unacceptable.
So I think the real question about pinned read gups is what semantics
they should have.
Because honestly, I think we have two options:
- the current "it gets a shared copy from the page tables"
- the "this is an exclusive pin, and it _will_ follow the source VM
changes, and never break"
because honestly, if we get a shared copy at the time of the pinning
(like we do now), then "fork()" is entirely immaterial. The fork() can
have happened ages ago, that page is shared with other processes, and
anybody process writing to it - including very much the pinning one -
will cause a copy-on-write and get a copy of the page.
IOW, the current - and past - semantics for read pinning is that you
get a copy of the page, but any changes made by the pinning process
may OR MAY NOT show up in your pinned copy.
Again: doing a concurrent fork() is entirely immaterial, because the
page can have been made a read-only COW page by _previous_ fork()
calls (or KSM logic or whatever).
In other words: read pinning gets a page efficiently, but there is
zero guarantee of any future coherence with the process doing
subsequent writes.
That has always been the semantics, and FOLL_PIN didn't change that at
all. You may have had things that worked almost by accident (ie you
had made the page private by writing to it after the fork, so the read
pinning _effectively_ gave you a page that was coherent), but even
that was always accidental rather than anything else. Afaik it could
easily be broken by KSM, for example.
In other words, a read pin isn't really any different from a read GUP.
You get a reference to a page that is valid at the time of the page
lookup, and absolutely nothing more.
Now, the alternative is to make a read pin have the same guarantees as
a write pin, and say "this will stay attached to this MM until unmap
or unpin".
But honestly, that is largely going to _be_ the same as a write pin,
because it absolutely needs to do a page COW at the time of the
pinning to get that initial exclusive guarantee in the first place.
Without that initial exclusivity, you cannot avoid future COW events
breaking the wrong way.
So I think the "you get a reference to the page at the time of the
pin, and the page _may_ or may not change under you if the original
process writes to it" are really the only relevant semantics. Because
if you need those exclusive semantics, you might as well just use a
write pin.
The downside of a write pin is that it not only makes that page
exclusive, it also (a) marks it dirty and (b) requires write access.
That can matter particularly for shared mappings. So if you know
you're doing the pin on a shared mmap, then a read pin is the right
thing, because the page will stay around - not because of the VM it
happens in, but because of the underlying file mapping!
See the difference?
> The other question is, whether we should emphasize and document somewhere that
> MADV_DONTFORK is still (and should always be) the preferred way, because
> changes like this series can potentially encourage the other way.
I really suspect that the concurrent fork() case is fundamentally hard
to handle.
Is it impossible? No. Even without any real locking, we could change
the code to do a seqcount_t, for example. The fastgup code wouldn't
take a lock, but it would just fail and fall back to the slow code if
the sequence count fails.
So the copy_page_range() code would do a write count around the copy:
write_seqcount_begin(&mm->seq);
.. do the copy ..
write_seqcount_end(&mm->seq);
and the fast-gup code would do a
seq = raw_read_seqcount(&mm->seq);
if (seq & 1)
return -EAGAIN;
at the top, and do a
if (__read_seqcount_t_retry(&mm->seq, seq) {
.. Uhhuh, that failed, drop the ref to the page again ..
return -EAGAIN;
}
after getting the pin reference.
We could make this conditional on FOLL_PIN, or maybe even a new flag
("FOLL_FORK_CONSISTENT").
So I think we can serialize with fork() without serializing each and every PTE.
If we want to and really need to.
Hmm?
Linus