Allocating all these page tables to install uffd-wp flags is also one of the
things I actually dislike about the new approach just to get more precision.
My take is that this is unavoidable if we need the accuracy. More below.
I wondered if it could be avoided, but my brain started to hurt. Just an
idea how to eventually avoid it:
We can catch access to these virtual memory that are not populated using
UFFD_MISSING mode. When installing a zeropage, we could set the uffd-wp bit.
Good point. :)
But we don't want to mix in the missing mode I guess. But maybe we could use
a similar approach for the uffd-wp async mode? Something like the following.
We'd want another mode(s?) for that, in addition to _ASYNC mode:
(a) When we hit an unpopulated PTE using read-access, we map a fresh page
(e.g., zeropage) and set the uffd-wp bit. This will make sure that the next
write access triggers uffd-wp.
(b) When we hit an unpopulated PTE using write-access, we only map a fresh
page (not setting the bit). We would want to trigger uffd-wp in !_ASYNC mode
Not setting uffd-wp bit sounds dangerous here. What if right after the
pgtable pte got setup then another thread writting to it? I think it's
data loss.
after that. In _ASYNC mode, all is good.
IIUC you're suggesting to have a new vma flag (or VM_UFFD_WP + some other
feature bit, which is fundamentally similar to a new vma flag) to show that
"when register uffd-wp on this region, protection starts right away". Then
it's not pte based, and we don't have problem on pgtable populations
either.
True, but it goes back to why we need pte markers. It has the accuracy,
alongside with the trade off of using the pgtables.
Without pte markers and uffd-wp bits everywhere, how do we tell "this pte
is none" or "even if this pte is none, it has been written before but just
got zapped, so we don't need to notify again"?
Fair enough, I won't interfere. The natural way for me to tackle this would
be to try fixing soft-dirty instead, or handle the details on how soft-dirty
is implemented internally: not exposing to user space that we are using
uffd-wp under the hood, for example.
Maybe that would be a reasonable approach? Handle this all internally if
possible, and remove the old soft-dirty infrastructure once it's working.
We wouldn't be able to use uffd-wp + softdirty, but who really cares I guess
...
The thing is userfaultfd is an exposed and formal kernel interface to
userspace already, before / if this new async mode will land. IMHO it's
necessary in this case to let the user know what's happening inside rather
than thinking this is not important and make decision for the user. We
don't want to surprise anyone I guess..
It's not only from the angle where an user may be using userfault in its
tracee app, so the user will know why the "new soft-dirty" won't work.
It's also about maintaining compatible with soft-dirty even if we want to
replace it some day with uffd-wp - it means there'll at least be a period
of having both of them exist, not until we know they're solidly replaceable
between each other.
So far it's definitely not in that stage.. and they're not alike - it's
just that some of us wanted to have soft-dirty change into something like
uffd-wp, then since the 1st way is not easily achievable, we can try the
other way round.
Right. And uffd-wp even supports hugetlb :)
While the other "uffd cannot be nested" defect is actually the same to
soft-dirty (no way to have a tracee being able to clear_refs itself or
it'll also go a mess), it's just that we can still use soft-dirty to track
an uffd application.
I wonder if we really care about that. Would be good to know if there are
any relevant softdirty users still around ... from what I understoodm even
CRIU wants to handle it using uffd-wp.
Yeah I don't know either.
Jup.
What does this mean?
Yes to the statement "So I assume there's no major issue to not continue
with a new version, then I'll move on." :)
But my idea at the very beginning might make sense to consider: can we
instead handle this at fault time and avoid allocating all these page
tables. Happy to hear if I am missing something important.
I've raised my questions above. I had a feeling that you're thinking for
anonymous mostly, because shmem is even trickier IIUC, because ptes can
easily got zapped, then if we only rely on a per-vma attribute, there'll be
tons of false positives.