Re: [PATCH v4] mm: use per_vma lock for MADV_DONTNEED
From: Kefeng Wang
Date: Tue Nov 04 2025 - 20:04:54 EST
On 2025/11/4 23:21, Lorenzo Stoakes wrote:
On Tue, Nov 04, 2025 at 08:09:58PM +0800, Kefeng Wang wrote:
On 2025/11/4 17:01, Lorenzo Stoakes wrote:
On Tue, Nov 04, 2025 at 04:34:35PM +0800, Kefeng Wang wrote:
+static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavior)
{
+ int behavior = madv_behavior->behavior;
+
if (is_memory_failure(behavior))
- return 0;
+ return MADVISE_NO_LOCK;
- if (madvise_need_mmap_write(behavior)) {
+ switch (behavior) {
+ case MADV_REMOVE:
+ case MADV_WILLNEED:
+ case MADV_COLD:
+ case MADV_PAGEOUT:
+ case MADV_FREE:
+ case MADV_POPULATE_READ:
+ case MADV_POPULATE_WRITE:
+ case MADV_COLLAPSE:
+ case MADV_GUARD_INSTALL:
+ case MADV_GUARD_REMOVE:
+ return MADVISE_MMAP_READ_LOCK;
+ case MADV_DONTNEED:
+ case MADV_DONTNEED_LOCKED:
+ return MADVISE_VMA_READ_LOCK;
I have a question, we will try per-vma lock for dontneed,
but there is a mmap_assert_locked() during madvise_dontneed_free(),
Hmm, this is only in the THP PUD huge case, and MADV_FREE is only valid for
anonymous memory, and I think only DAX can have some weird THP PUD case.
So I don't think we can hit this.
Yes, we don't support pud THP for anonymous pages.
Right, so we can't hit this.
In any event, I think this mmap_assert_locked() is mistaken, as we should
only need a VMA lock here.
So we could replace with a:
if (!rwsem_is_locked(&tlb->mm->mmap_lock))
vma_assert_locked(vma);
?
The pmd dax/anon split don't have assert, for PUD dax, we maybe remove this
assert?
Well, we probably do want to assert that we hold a lock.
OK, let's convert to vma_assert_locked.
madvise_dontneed_free
madvise_dontneed_single_vma
zap_page_range_single_batched
unmap_single_vma
unmap_page_range
zap_pud_range
mmap_assert_locked
We could fix it by passing the lock_mode into zap_detial and then check
the right lock here, but I'm not sure whether it is safe to zap page
only with vma lock?
It's fine to zap with the VMA lock. You need only hold the VMA stable which
a VMA lock achieves.
See https://docs.kernel.org/mm/process_addrs.html
Thanks, I will learn it.
Hopefully useful, I made it to remind myself of these things as they're very
fiddly + otherwise I find myself constantly forgetting these details :)
That should be definitely useful :)
And another about 4f8ba33bbdfc ("mm: madvise: use per_vma lock
for MADV_FREE"), it called walk_page_range_vma() in
madvise_free_single_vma(), but from link[1] and 5631da56c9a8
("fs/proc/task_mmu: read proc/pid/maps under per-vma lock"), it saids
"Note that similar approach would not work for /proc/pid/smaps
reading as it also walks the page table and that's not RCU-safe"
We could use walk_page_range_vma() instead of walk_page_range() in
smap_gather_stats(), and same question, why 4f8ba33bbdfc(for MADV_FREEE)
is safe but not for show_numa_map()/show_smap()?
We only use walk_page_range() there in case 4 listed in show_smaps_rollup()
where the mmap lock is dropped on contention.
Sorry, I mean the walk_page_range() in smap_gather_stats() called by
show_smap() from /proc/pid/smaps, not the walk_page_range() in
show_smaps_rollup() from /proc/pid/smaps_rollup.
show_smaps()
-> smap_gather_stats(..., start = 0)
-> walk_page_vma()
Because:
if (!start)
walk_page_vma(vma, ops, mss);
The only case where start is non-zero is show_smaps_rollup() case 4. So we are
already using walk_page_vma() here right?
I may be missing something here :)
You are right, I don't check start value :(
Thanks.
[1] https://lkml.kernel.org/r/20250719182854.3166724-1-surenb@xxxxxxxxxx
AFAICT That's referring to a previous approach that tried to walk
/proc/$pid/swaps under RCU _alone_ without VMA locks. This is not safe as
page tables can be yanked from under you not under RCU.
But for now it tries per-vma lock or fallback to mmap lock, not lockless, so
do you mean we could try per-vma lock for /proc/pid/numa_maps or
/proc/pid/smaps ?
Probably we could, but I'm not sure if it'd be really worth it, since traversing
page tables is a very heavy operation and so optimising it against contention
like this seems probably not all that worth it?
Suren maybe could comment on this.
They only operate a single vma(walk_page_vma), I think it is always better if we could only hold the vma lock, but wait for Suren comment on it.