On 04.08.21 21:17, Peter Xu wrote:
On Wed, Aug 04, 2021 at 08:49:14PM +0200, David Hildenbrand wrote:
TBH, I tend to really dislike the PTE marker idea. IMHO, we shouldn't store
any state information regarding shared memory in per-process page tables: it
just doesn't make too much sense.
And this is similar to SOFTDIRTY or UFFD_WP bits: this information actually
belongs to the shared file ("did *someone* write to this page", "is
*someone* interested into changes to that page", "is there something"). I
know, that screams for a completely different design in respect to these
features.
I guess we start learning the hard way that shared memory is just different
and requires different interfaces than per-process page table interfaces we
have (pagemap, userfaultfd).
I didn't have time to explore any alternatives yet, but I wonder if tracking
such stuff per an actual fd/memfd and not via process page tables is
actually the right and clean approach. There are certainly many issues to
solve, but conceptually to me it feels more natural to have these shared
memory features not mangled into process page tables.
Yes, we can explore all the possibilities, I'm totally fine with it.
I just want to say I still don't think when there's page cache then we must put
all the page-relevant things into the page cache.
[sorry for the late reply]
Right, but for the case of shared, swapped out pages, the information is
already there, in the page cache :)
They're shared by processes, but process can still have its own way to describe
the relationship to that page in the cache, to me it's as simple as "we allow
process A to write to page cache P", while "we don't allow process B to write
to the same page" like the write bit.
The issue I'm having uffd-wp as it was proposed for shared memory is
that there is hardly a sane use case where we would *want* it to work
that way.
A UFFD-WP flag in a page table for shared memory means "please notify
once this process modifies the shared memory (via page tables, not via
any other fd modification)". Do we have an example application where
these semantics makes sense and don't over-complicate the whole
approach? I don't know any, thus I'm asking dumb questions :)
For background snapshots in QEMU the flow would currently be like this,
assuming all processes have the shared guest memory mapped.
1. Background snapshot preparation: QEMU requests all processes
to uffd-wp the range
a) All processes register a uffd handler on guest RAM
b) All processes fault in all guest memory (essentially populating all
memory): with a uffd-WP extensions we might be able to get rid of
that, I remember you were working on that.
c) All processes uffd-WP the range to set the bit in their page table
2. Background snapshot runs:
a) A process either receives a UFFD-WP event and forwards it to QEMU or
QEMU polls all other processes for UFFD events.
b) QEMU writes the to-be-changed page to the migration stream.
c) QEMU triggers all processes to un-protect the page and wake up any
waiters. All processes clear the uffd-WP bit in their page tables.