Hi David,
Thanks for such a quick reply!
On 2024/6/13 17:04, David Hildenbrand wrote:
On 13.06.24 10:38, Qi Zheng wrote:
Hi all,
[...]
3. Implementation
=================
For empty user PTE pages, we don't actually need to free it
immediately, nor do
we need to free all of it.
Therefore, in this patchset, we register a task_work for the user
tasks to
asyncronously scan and free empty PTE pages when they return to user
space.
(The scanning time interval and address space size can be adjusted.)
The question is, if we really have to scan asynchronously, or if would
be reasonable for most use cases to trigger a madvise(MADV_PT_RECLAIM)
every now and then. For virtio-mem, and likely most memory allocators,
that might be feasible, and valuable independent of system-wide
automatic scanning.
Agree, I also think it is possible to add always && madvise modes
simliar to THP.
When scanning, we can filter out some unsuitable vmas:
- VM_HUGETLB vma
- VM_UFFD_WP vma
Why is UFFD_WP unsuitable? It should be suitable as long as you make
sure to really only remove page tables that are all pte_none().
Got it, I mistakenly thought pte_none() covered pte marker case until
I saw pte_none_mostly().
- etc
And for some PTE pages that spans multiple vmas, we can also skip.
For locking:
- use the mmap read lock to traverse the vma tree and pgtable
- use pmd lock for clearing pmd entry
- use pte lock for checking empty PTE page, and release it after
clearing
pmd entry, then we can capture the changed pmd in
pte_offset_map_lock()
etc after holding this pte lock. Thanks to this, we don't need
to hold the
rmap-related locks.
- users of pte_offset_map_lock() etc all expect the PTE page to
be stable by
using rcu lock, so use pte_free_defer() to free PTE pages.
I once had a protoype that would scan similar to GUP-fast, using the
mmap lock in read mode and disabling local IRQs and then walking the
page table locklessly (no PTLs). Only when identifying an empty page and
ripping out the page table, it would have to do more heavy locking (back
when we required the mmap lock in write mode and other things).
Maybe mmap write lock is not necessary, we can protect it using pmd lock
&& pte lock as above.
I can try digging up that patch if you're interested.
Yes, that would be better, maybe it can provide more inspiration!
We'll have to double check whether all anon memory cases can *properly*
handle pte_offset_map_lock() failing (not just handling it, but doing
the right thing; most of that anon-only code didn't ever run into that
issue so far, so these code paths were likely never triggered).
Yeah, I'll keep checking this out too.
For the path that will also free PTE pages in THP, we need to recheck
whether the
content of pmd entry is valid after holding pmd lock or pte lock.
4. TODO
=======
Some applications may be concerned about the overhead of scanning and
rebuilding
page tables, so the following features are considered for
implementation in the
future:
- add per-process switch (via prctl)
- add a madvise option (like THP)
- add MM_PGTABLE_SCAN_DELAY/MM_PGTABLE_SCAN_SIZE control (via
procfs file)
Perhaps we can add the refcount to PTE pages in the future as well,
which would
help improve the scanning speed.
I didn't like the added complexity last time, and the problem of
handling situations where we squeeze multiple page tables into a single
"struct page".
OK, except for refcount, do you think the other three todos above are
still worth doing?