On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote:
In this patch series, we add a pte_refcount field to the struct page of page
table to track how many users of PTE page table. Similar to the mechanism of
page refcount, the user of PTE page table should hold a refcount to it before
accessing. The PTE page table page will be freed when the last refcount is
dropped.
So, this approach basically adds two atomics on every PTE map
If I have it right the reason that zap cannot clean the PTEs today is
because zap cannot obtain the mmap lock due to a lock ordering issue
with the inode lock vs mmap lock.
If it could obtain the mmap lock then it could do the zap using the
write side as unmapping a vma does.
Rather than adding a new "lock" to ever PTE I wonder if it would be
more efficient to break up the mmap lock and introduce a specific
rwsem for the page table itself, in addition to the PTL. Currently the
mmap lock is protecting both the vma list and the page table.
I think that would allow the lock ordering issue to be resolved and
zap could obtain a page table rwsem.
Compared to two atomics per PTE this would just be two atomic per
page table walk operation, it is conceptually a lot simpler, and would
allow freeing all the page table levels, not just PTEs.
?
Jason