For uffd-wp in its current form, it would certainly be the way to go I
think. AFAIKT, the idea of special swap entries isn't new, just that it's
limited to anonymous memory for now, which makes things like fork and new
mappings a lot cheaper.
Thanks for reviewing this series separately; yes I definitely wanted to get
comments on both sides: one on the pte marker idea, the other is whether it's
applicable to this swap+shmem use case.
Here I really want to make the pte marker be flexible - it can be strict when
necessary (it will be 100% strict with uffd-wp), then it can also be a hint
just like what we have with available ptes on soft-dirty, idle, accessed bits.
Here the swap bit I wanted to make it that kind, so we add zero overhead to
fork() and we still solve problems.
Same thing to "newly mapped shmem". Do we have a use case for that? If that's
a hint bit, can we ignore it?
As already expressed, we should try storing as little information in page
tables as possible if we're dealing with shared memory. The features we
design around this all seem to over-complicate the actual users,
over-complicate fork, over-complicate handling on new mappings.
I'll skip the last two "over-complicated" items, because as we've discussed I
don't think we need to take care of them so far. We can revisit when they
become some kind of requirement.
To me having PM_SWAP 99% right on shmem is still a progress comparing to
completely missing it, even if it's not 100% right. It's used for performance
reasons on PAGEOUT and doing finer-grained memory control from userspace, it's
not a strict requirement.
So I still cannot strictly follow why storing information in pte is so bad for
file-backed, which I can see you really don't like it. Could you share some
detailed example?
But I guess I'm biased at this point because the main users of these
features actually want to query/set such properties for all sharers, not
individual processes; so the opinion of others would be appreciated.
Known Issues/Concerns
=====================
About THP
---------
Currently we don't need to worry about THP because paged out shmem pages will
be split when shrinking, IOW we only need to consider PTE, and the markers will
only be applied to a shmem pte not pmd or bigger.
About PM_SWAP Accuracy
----------------------
This is not an "accurate" solution to provide PM_SWAP bit. Two exmaples:
- When process A & B both map shmem page P somewhere, it can happen that only
one of these ptes got marked with the pte marker. Imagine below sequence:
0. Process A & B both map shmem page P somewhere
1. Process A zap pte of page P for some reason (e.g. thp split)
2. System decides to recycle page P
3. System replace process B's pte (pointed to P) by PTE marker
4. System _didn't_ replace process A's pte because it was none pte, and
it'll continue to be none pte
5. Only process B's relevant pte has the PTE marker after P swapped out
- When fork, we don't copy shmem vma ptes, including the pte markers. So
even if page P was swapped out, only the parent process has the pte marker
installed, in child it'll be none pte if fork() happened after pageout.
Conclusion: just like it used to be, the PM_SWAP is best-effort. But it should
work in 99.99% cases and it should already start to solve problems.
At least I don't like these semantics at all. PM_SWAP is a cached value
which might be under-represented and consequently wrong.
Please have a look at current pagemap impl in pte_to_pagemap_entry(). It's not
accurate from the 1st day, imho. E.g., when a page is being migrated from numa
node 1 to node 2, we'll mark it PM_SWAP but I think it's not the case. We can
make it more accurate, but I think it's fine, because it's a hint.
Take CRIU as an example, it has to be correct even if a process would remap a
memory region, fork() and unmap in the parent as far as I understand, ...
Are you talking about dirty bit or swap bit? I'm a bit confused on why swap
bit needs to be accurate. Maybe you mean the dirty bit?