Re: [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER

From: David Hildenbrand
Date: Tue Aug 17 2021 - 14:47:02 EST



For uffd-wp in its current form, it would certainly be the way to go I
think. AFAIKT, the idea of special swap entries isn't new, just that it's
limited to anonymous memory for now, which makes things like fork and new
mappings a lot cheaper.

Thanks for reviewing this series separately; yes I definitely wanted to get
comments on both sides: one on the pte marker idea, the other is whether it's
applicable to this swap+shmem use case.

Exactly.


Here I really want to make the pte marker be flexible - it can be strict when
necessary (it will be 100% strict with uffd-wp), then it can also be a hint
just like what we have with available ptes on soft-dirty, idle, accessed bits.
Here the swap bit I wanted to make it that kind, so we add zero overhead to
fork() and we still solve problems.

Same thing to "newly mapped shmem". Do we have a use case for that? If that's
a hint bit, can we ignore it?

I am really not a fan of taking an already broken feature (broken presence information for shmem in pagemap) and instead of fixing it properly, turning it less broken, crossing fingers that nobody will notice the remaining (undocumented) cases.

[...]

As already expressed, we should try storing as little information in page
tables as possible if we're dealing with shared memory. The features we
design around this all seem to over-complicate the actual users,
over-complicate fork, over-complicate handling on new mappings.

I'll skip the last two "over-complicated" items, because as we've discussed I
don't think we need to take care of them so far. We can revisit when they
become some kind of requirement.

To me having PM_SWAP 99% right on shmem is still a progress comparing to
completely missing it, even if it's not 100% right. It's used for performance
reasons on PAGEOUT and doing finer-grained memory control from userspace, it's
not a strict requirement.

So I still cannot strictly follow why storing information in pte is so bad for
file-backed, which I can see you really don't like it. Could you share some
detailed example?

I am not a fan of your approach of "hints", because while it might work for some use cases, it might not work for others (see below for CRIU); I would rather like to avoid such "inconsistent caches" where it's not really required. But again, this is just my opinion and there might be plenty other people that will most probably disagree.

Storing hints in page tables isn't actually that bad, because we can drop hints whenever we like (well, there are side effects, and once we drop hints too often people might complain again) -- for example, when reclaiming "empty" but actually "filled with hints" page tables. When we rely on consistent values, fork() and mmap() are a problem. Well, and page tables will have to stick around. At least for uffd-wp, mmap() is not an issue, and we don't expect too many uffd-wp users such that page table consumption would matter -- my guess.

So I repeat, for uffd-wp in its current form, it sounds like the right thing to do.

But I guess I'm biased at this point because the main users of these
features actually want to query/set such properties for all sharers, not
individual processes; so the opinion of others would be appreciated.


Known Issues/Concerns
=====================

About THP
---------

Currently we don't need to worry about THP because paged out shmem pages will
be split when shrinking, IOW we only need to consider PTE, and the markers will
only be applied to a shmem pte not pmd or bigger.

About PM_SWAP Accuracy
----------------------

This is not an "accurate" solution to provide PM_SWAP bit. Two exmaples:

- When process A & B both map shmem page P somewhere, it can happen that only
one of these ptes got marked with the pte marker. Imagine below sequence:

0. Process A & B both map shmem page P somewhere
1. Process A zap pte of page P for some reason (e.g. thp split)
2. System decides to recycle page P
3. System replace process B's pte (pointed to P) by PTE marker
4. System _didn't_ replace process A's pte because it was none pte, and
it'll continue to be none pte
5. Only process B's relevant pte has the PTE marker after P swapped out

- When fork, we don't copy shmem vma ptes, including the pte markers. So
even if page P was swapped out, only the parent process has the pte marker
installed, in child it'll be none pte if fork() happened after pageout.

Conclusion: just like it used to be, the PM_SWAP is best-effort. But it should
work in 99.99% cases and it should already start to solve problems.

At least I don't like these semantics at all. PM_SWAP is a cached value
which might be under-represented and consequently wrong.

Please have a look at current pagemap impl in pte_to_pagemap_entry(). It's not
accurate from the 1st day, imho. E.g., when a page is being migrated from numa
node 1 to node 2, we'll mark it PM_SWAP but I think it's not the case. We can
make it more accurate, but I think it's fine, because it's a hint.

That inconsistency doesn't really matter as you can determine if something is present and worth dumping if it's either swapped or present. As long as it's one of both but not simply nothing.

I will shamelessly reference tools/testing/selftests/vm/madv_populate.c:pagemap_is_populated() that checks exactly for that (the test case uses only private anonymous memory).


Take CRIU as an example, it has to be correct even if a process would remap a
memory region, fork() and unmap in the parent as far as I understand, ...

Are you talking about dirty bit or swap bit? I'm a bit confused on why swap
bit needs to be accurate. Maybe you mean the dirty bit?

https://criu.org/Shared_memory

"Dumping present pages"

"... CRIU does not dump all of the data. Instead, it determines which pages contain it, and only dumps those pages. This is done similarly to how regular memory dumping and restoring works, i.e. by looking for PRESENT or SWAPPED bits in owners' pagemap entries."

-> Neither PRESENT nor SWAPPED results in memory not getting dumped, which makes perfect sense.

1) Process A sets up shared memory and writes data to it.
2) System swaps out memory, hints are setup.
3) Process A forks Process B, hints are not copied.
4) Process A unmaps shared memory, hints are dropped.
5) CRIU migrates process A and B and migrates only PRESENT or SWAPPED in pagemap.
6) Process B uses memory in shared memory region. Pages were not migrated.

Just one example; feel free to correct me.


There is notion of the mincore() systemcall:

"There is one particular feature of shared memory dumps worth mentioning. Sometimes, a shared memory page can exist in the kernel, but it is not mapped to any process. CRIU detects such pages by calling mincore() on the shmem segment, which reports back the page in-memory status. The mincore bitmap is when ANDed with the per-process ones. "

Not sure if they actually mean ORed, because otherwise they'd be losing pages that have been swapped out. "mincore() returns a vector that indicates whether pages of the calling process's virtual memory are resident in core (RAM)"

--
Thanks,

David / dhildenb