You always have very good summary of the situation. Thanks a lot forSounds good?
Adding to the discussion, currently the COW selftest always skips a PTE-mapped THP.
adding the following information.
Add Zi Yan as this is still about mapcount of the folio.
So you want to accurate number. My understanding is that it's required for COW case.
For example:
# [INFO] Anonymous memory tests in private mappings
# [RUN] Basic COW after fork() ... with base page
ok 1 No leak from parent into child
# [RUN] Basic COW after fork() ... with swapped out base page
ok 2 No leak from parent into child
# [RUN] Basic COW after fork() ... with THP
ok 3 No leak from parent into child
# [RUN] Basic COW after fork() ... with swapped-out THP
ok 4 No leak from parent into child
# [RUN] Basic COW after fork() ... with PTE-mapped THP
ok 5 No leak from parent into child
# [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP
ok 6 # SKIP MADV_PAGEOUT did not work, is swap enabled?
...
The commit that introduced that change is:
commit 07e8c82b5eff8ef34b74210eacb8d9c4a2886b82
Author: Vishal Moola (Oracle) <vishal.moola@xxxxxxxxx>
Date: Wed Dec 21 10:08:46 2022 -0800
madvise: convert madvise_cold_or_pageout_pte_range() to use folios
This change removes a number of calls to compound_head(), and saves
1729 bytes of kernel text.
folio_mapcount(folio) is wrong, because that never works on a PTE-mapped THP (well, unless only a single subpage is still mapped ...).
page_mapcount(folio) was wrong, because it ignored all other subpages, but at least it worked in some cases.
folio_estimated_sharers(folio) is similarly wrong like page_mapcount(), as it's essentially a page_mapcount() of the first subpage.
(ignoring that a lockless mapcount-based check is always kind-of unreliable, but that's msotly acceptable for these kind of things)
So, unfortunately, page_mapcount() / folio_estimated_sharers() is best we can do for now, but they miss to detect some cases of sharing of the folio -- false negatives to detect sharing.
Ideally we want something like folio_maybe_mapped_shared(), and get rid of folio_estimated_sharers(), we better to guess the exact number, simply works towards an answer that tells us "yep, this may be mapped by multiple sharers" vs. "no, this is definitely not mapped by multiple sharers".
I was wondering whether we could identify the cases as:
While it's better than what we have right now:
(a) It's racy. Well, it's always been racy with concurrent (un)mapping
and splitting. But maybe we can do better.
(b) folio_total_mapcount() is currently expensive.
(c) there are still false negatives even without races.
For anon pages, we could scan all subpages and test if they are PageAnonExclusive, but it's also not really what we want to do here.
- bold estimated mapcount is enough. In this case, we can use
current folio_estimated_sharers() for now. For long term, I
am with Zi Yan's proposal: maintain total_mapcount and just use
total_mapcount > folio_nr_pages() as estimated number.
The madvise/migration cases are identified as this type.
- Need some level accurate. Use estimated mapcount to filter out obvious
shared case first as estimated mapcount is correct for shared case.
Then use some heavy operations (check anon folio if pages are
PageAnonExclusive or use pvmw) to get more accurate number.
Cow is identified as this type.
Great.
I have some idea to handle anon pages better to avoid any page table walk or subpage scan, improving (a), (b) and (c). It might work for pagecache pages with some more work, but it's a bit more complicated with the scheme I have in mind).
I saw Zi Yan shared same proposal.
First step would be replacing folio->_nr_pages_mapped by folio->_total_mapcount. While we could eventually have folio->_total_mapcount in addition to folio->_nr_pages_mapped, I'm, not sure if we want to go down that path
It's nasty because partial mapped to parent and partial mapped to child? Thanks.
That would make folio_total_mapcount() extremely fast (I'm working on a prototype). The downsides are that
(a) We have to make NR_ANON_MAPPED/NR_FILE_MAPPED accounting less precise. Easiest way to handle it: as soon as a single subpage is mapped, account the whole folio as mapped. After all, it's consuming memory, so who cares?
(b) We have to find a different way to decide when to put an anonymous folio on the deferred split queue in page_remove_rmap(). Some cases are nasty to handle: PTE-mapped THP that are shared between a parent and a child.