Maybe, we'd have to do the isolation+copy under the PMD lock. And
currently, we have to drop the PMD lock in order to have the
pte_offset_map_lock() work IIRC.
Is there a problem in holding two page table locks simultaneously?
Depending on CONFIG_SPLIT_PTE_PTLOCKS, it might not even be two locks (I assume we could have such configs with khugepaged).
Not sure if there could be an issue with lock inversion.
So I suspect this no not be 100% trivial :)
Most importantly, the copy that currently runs under no spinlocks
would now run under spinlock. Up to 512 MiB on arm64 64K, not sure if
that can be a problem ... we currently seem to take care of that
But we already are taking mmap_write_lock(), so that should not matter?
We are dealing with a spinlock vs. a rwsem.
We usually want to avoid holding spinlocks for an excessive amount of time, because all other CPUs waiting for that lock will ... spin with preemption disabled instead of rescheduling and doing something useful.
Further, without CONFIG_SPLIT_PMD_PTLOCKS, in fact everybody who wnats to take a PMD lock in that process would be spinning on the same PMD lock :)
I mean, if we can get rid of the mmap exclusive lock, then the copying
would still be a bottleneck, and all fault handlers will back off, but
I'm trying to digest it once again, but I'm afraid I don't see how fault handlers will back off.
Won't they either see pmd_none(), to then either call pte_alloc_one() where they would spin on the PMD lock, or try allocating a PMD THP to insert it, and then spin on the PMD lock, to figure out later that it was all in vain?
Yes, I thought of different locking fashions but the problem seems to be that any pagetable walker will choose an action path according to the value
Thinking about it, I am also not sure if most other code can deal with temporary pmd_none(). These don't necessarily take the PMD lock, because "why would they" right now.
See walk_pmd_range() as one example, if it spots pmd_none() it assumes "there really is nothing" without checking the PMD lock.
As a more concrete example, assume someone calls MADV_DONTNEED and we end up in zap_pmd_range(). If we assume "pmd_none() == really nothing" we'll skip that entry without getting the PMD lock involved. That would mean that you would be breaking MADV_DONTNEED if you managed to collapse or not -- memory would not get discarded.
This is a real problem with anonymous memory.
Unless I am missing something it's all very tricky and there might be a lot of such code that assumes "if I hold a mmap lock / VMA lock in read mode, pmd_none() means there is nothing even without holding the PMD lock when checking".
at least processes will be able to mmap() and do stuff with their VMAs,
and I would guess that this is worth optimizing...
It would certainly be interesting to get rid of the mmap lock in write mode here, but it's all rater tricky (and the code has rather nasty hidden implications).
pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
if (pte) {
...
spun_unlock(pte);
} ...
result = __collapse_huge_page_copy(...);
pte_unmap(pte);
Deep in __collapse_huge_page_copy() we seem to re-rake the PTL lock.
No-split-spinlock confiogs might be problematic ...
Could you elaborate a little? I haven't read about the older config...
See above regarding CONFIG_SPLIT_PTE_PTLOCKS and friends.
I recall that for shmem that's "easier", because we don't have to
reinstall a PMD immediately, we cna be sure that the page table is
kept empty/unmodified, ...