On 30.04.21 21:52, Michel Lespinasse wrote:
This patchset is my take on speculative page faults (spf).
It builds on ideas that have been previously proposed by Laurent Dufour,
Peter Zijlstra and others before. While Laurent's previous proposal
was rejected around the time of LSF/MM 2019, I am hoping we can revisit
this now based on what I think is a simpler and more bisectable approach,
much improved scaling numbers in the anonymous vma case, and the Android
use case that has since emerged. I will expand on these points towards
the end of this message.
The patch series applies on top of linux v5.12;
a git tree is also available:
git fetch https://github.com/lespinasse/linux.git v5.12-spf-anon
I believe these patches should be considered for merging.
My github also has a v5.12-spf branch which extends this mechanism
for handling file mapped vmas too; however I believe these are less
mature and I am not submitting them for inclusion at this point.
Compared to the previous (RFC) proposal, I have split out / left out
the file VMA handling parts, fixed some config specific build issues,
added a few more comments and modified the speculative fault handling
to use rcu_read_lock() rather than local_irq_disable() in the
Classical page fault processing takes the mmap read lock in order to
prevent races with mmap writers. In contrast, speculative fault
processing does not take the mmap read lock, and instead verifies,
when the results of the page fault are about to get committed and
become visible to other threads, that no mmap writers have been
running concurrently with the page fault. If the check fails,
speculative updates do not get committed and the fault is retried
in the usual, non-speculative way (with the mmap read lock held).
The concurrency check is implemented using a per-mm mmap sequence count.
The counter is incremented at the beginning and end of each mmap write
operation. If the counter is initially observed to have an even value,
and has the same value later on, the observer can deduce that no mmap
writers have been running concurrently with it between those two times.
This is similar to a seqlock, except that readers never spin on the
counter value (they would instead revert to taking the mmap read lock),
and writers are allowed to sleep. One benefit of this approach is that
it requires no writer side changes, just some hooks in the mmap write
lock APIs that writers already use.
The first step of a speculative page fault is to look up the vma and
read its contents (currently by making a copy of the vma, though in
principle it would be sufficient to only read the vma attributes that
are used in page faults). The mmap sequence count is used to verify
that there were no mmap writers concurrent to the lookup and copy steps.
Note that walking rbtrees while there may potentially be concurrent
writers is not an entirely new idea in linux, as latched rbtrees
are already doing this. This is safe as long as the lookup is
followed by a sequence check to verify that concurrency did not
actually occur (and abort the speculative fault if it did).
The next step is to walk down the existing page table tree to find the
current pte entry. This is done with interrupts disabled to avoid
races with munmap(). Again, not an entirely new idea, as this repeats
a pattern already present in fast GUP. Similar precautions are also
taken when taking the page table lock.
I just started working on a project to reclaim page tables inside
running processes that are no longer needed (for example, empty after
madvise(DISCARD)). Long story short, there are scenarios where we want
to scan for such page tables asynchronously to free up memory (which can
be quite significant in some use cases).
Now that I (mostly) understood the complex locking, I'm looking for
other mm features that might be "problematic" in that regard and require
properly planning to get right (or let them run mutually exclusive).
As I essentially rip out page tables from the page table hierarchy to
free them (in the simplest case within a VMA to get started), I
certainly need the mmap lock in read right now to scan the page table
hierarchy, and the mmap lock in write when actually removing a page
table. This is similar handling as khugepagd when collapsing a THP and
removing a page table. Of course, we could use any kind of
synchronization mechanism (-> rcu) to make sure nobody is using a page
table anymore before actually freeing it.
1. I now wonder how your code actually protects against e.g., khugepaged
and how it could protect against page table reclaim. Will we be using
RCU while walking the page tables? That would make life easier.
2. You mention "interrupts disabled to avoid races with munmap()". Can
you elaborate how that is supposed to work? Shouldn't we rather be using
RCU than manually disabling interrupts? What is the rationale?