Re: [PATCH 2/2] fs/proc/task_mmu: read proc/pid/smaps_rollup under per-vma lock

From: Lorenzo Stoakes

Date: Tue Jun 09 2026 - 06:03:58 EST

On Fri, Jun 05, 2026 at 06:57:29PM -0700, Suren Baghdasaryan wrote:
> proc/pid/smaps_rollup can be read using the combination of RCU and
> VMA read locks, similar to proc/pid/{maps|smaps|numa_maps}. RCU is
> required to safely traverse the VMA tree and VMA lock stabilizes the
> VMA being processed and the pagetable walk.
> Note that we have to keep the logic to drop mmap_lock on contention
> because even when using per-VMA locks we might have to fall back to
> holding the mmap_lock.
>
> Running Paul's contention benchmark [1] shows considerable improvement
> both in median and in the worst case latencies:
>
> Execution command: run-proc-vs-map.sh --nsamples 20 --rawdata -- \
> --busyduration 2 --procfile smaps_rollup
>
> Baseline:
>
> 0.174 0.161 2.553
> 0.174 0.164 2.663
> 0.174 0.165 2.664
> 0.174 0.166 2.679
> 0.174 0.167 2.691
> 0.174 0.168 2.704
> 0.174 0.169 2.729
> 0.174 0.172 2.741
> 0.174 0.174 2.745
> 0.174 0.174 2.755
> 0.174 0.175 2.790
> 0.174 0.177 2.809
> 0.174 0.179 3.096
> 0.174 0.183 3.144
> 0.174 0.184 3.158
> 0.174 0.185 3.175
> 0.174 0.185 4.568
> 0.174 0.198 4.821
> 0.174 0.214 5.143
> 0.174 0.251 5.220
>
> Patched:
>
> 0.007 0.007 1.952
> 0.007 0.007 1.955
> 0.007 0.007 1.955
> 0.007 0.007 1.955
> 0.007 0.007 1.957
> 0.007 0.007 1.969
> 0.007 0.007 2.065
> 0.007 0.007 2.075
> 0.007 0.007 2.146
> 0.007 0.007 2.195
> 0.007 0.007 2.223
> 0.007 0.007 2.259
> 0.007 0.007 2.488
> 0.007 0.007 2.562
> 0.007 0.007 2.599
> 0.007 0.007 2.697
> 0.007 0.007 3.030
> 0.007 0.007 3.075
> 0.007 0.007 3.145
> 0.007 0.007 3.225

Ohhh lovely!

>
> [1] https://github.com/paulmckrcu/proc-mmap_sem-test
>
> Signed-off-by: Suren Baghdasaryan <surenb@xxxxxxxxxx>

Having looked through it carefully the logic looks correct to me, but I
think we can improve how this is implemented, so attach a patch with my
suggestions folded up to make life easier!

> ---
> fs/proc/task_mmu.c | 134 ++++++++++++++++++++-------------------------
> 1 file changed, 59 insertions(+), 75 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 023422fcee12..c2bd9f5bbbcd 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -233,6 +233,16 @@ static inline void reacquire_rcu(struct proc_maps_private *priv)
> vma_iter_set(&priv->iter, priv->lock_ctx.locked_vma->vm_end);
> }
>
> +static inline bool is_mmap_lock_contended(struct proc_maps_private *priv)

Hmm seems David and I are saying the same things :) but yeah this name being
very close to mmap_lock_is_contended() is suboptimal.

Also the inline is pointless here and suppresses the compiler unused check which
isn't helpful.

But in general, given we already have the lock context I think we can do away
with this altogether, see below.

> +{
> + struct proc_maps_locking_ctx *lock_ctx = &priv->lock_ctx;
> +
> + if (!lock_ctx->mmap_locked)
> + return false;
> +
> + return !!mmap_lock_is_contended(lock_ctx->mm);

As David said, !! is not a necessary incantation any more :)

> +}
> +
> #else /* CONFIG_PER_VMA_LOCK */
>
> static inline int lock_ctx_mm(struct proc_maps_locking_ctx *lock_ctx)
> @@ -268,6 +278,11 @@ static inline bool fallback_to_mmap_lock(struct proc_maps_private *priv,
> return false;
> }
>
> +static inline bool is_mmap_lock_contended(struct proc_maps_private *priv)
> +{
> + return !!mmap_lock_is_contended(priv->lock_ctx.mm);

Similarly. But also as above I don't think we need this.

> +}
> +
> static inline void drop_rcu(struct proc_maps_private *priv) {}
> static inline void reacquire_rcu(struct proc_maps_private *priv) {}
>
> @@ -1486,12 +1501,15 @@ static int show_smap(struct seq_file *m, void *v)
> static int show_smaps_rollup(struct seq_file *m, void *v)
> {
> struct proc_maps_private *priv = m->private;
> + struct proc_maps_locking_ctx *lock_ctx = &priv->lock_ctx;
> + struct mm_struct *mm = lock_ctx->mm;
> struct mem_size_stats mss = {};
> - struct mm_struct *mm = priv->lock_ctx.mm;
> struct vm_area_struct *vma;
> - unsigned long vma_start = 0, last_vma_end = 0;
> + unsigned long vma_start = 0;
> + unsigned long last_vma_end = 0;

I mean I kinda prefer these separate but obviously not necessary :P

> + loff_t pos = 0;
> int ret = 0;
> - VMA_ITERATOR(vmi, mm, 0);
> +

Yeah extra line as David said :)

>
> priv->task = get_proc_task(priv->inode);
> if (!priv->task)
> @@ -1502,89 +1520,55 @@ static int show_smaps_rollup(struct seq_file *m, void *v)
> goto out_put_task;
> }
>
> - ret = lock_ctx_mm(&priv->lock_ctx);
> + hold_task_mempolicy(priv);
> + ret = lock_vma_range(m, lock_ctx);
> if (ret)
> goto out_put_mm;
>
> - hold_task_mempolicy(priv);
> - vma = vma_next(&vmi);
> -
> - if (unlikely(!vma))
> + vma_iter_init(&priv->iter, mm, 0);
> + vma = proc_get_vma(m, &pos);
> + if (unlikely(!vma) || vma == get_gate_vma(priv->lock_ctx.mm))
> goto empty_set;

Is the gate VMA check here necessary? We already check it in the loop.

>
> + if (IS_ERR(vma)) {
> + ret = PTR_ERR(vma);
> + goto out_unlock;
> + }

Again this is duplicated.

> +
> vma_start = vma->vm_start;

Could just drop the gate check, and replace this with:

vma_start = IS_ERR(vma) ? 0 : vma->vm_start;

> - do {
> - smap_gather_stats(priv, vma, &mss, 0);
> + while (vma) {
> + if (IS_ERR(vma)) {
> + ret = PTR_ERR(vma);
> + goto out_unlock;
> + }
> +
> + if (vma == get_gate_vma(priv->lock_ctx.mm))
> + break;
> +
> + /*
> + * If after retaking mmap_lock, already reported VMA grew or
> + * merged with the next one, then iterate from last_vma_end.
> + */

Hmm, but if the already reported VMA grew, vma->vm_start would not be <
last_vma_end would it?

So surely this is only covering the merge case?

> + smap_gather_stats(priv, vma, &mss,
> + vma->vm_start < last_vma_end ? last_vma_end : 0);

I don't love how compressed this is.

I also don't love that 0 is taken to be 'start from vma->vm_start' and I
also don't love that the code in smap_gather_stats() actually special cases
this...

How about passing last_vma_end and making smap_gather_stats() more sane? In
the other invocation of smap_gather_stats() we could pass vma->vm_start
here.

We can then move the comment into smap_gather_stats(). E.g.:

/* Handles the case of VMA merged since mmap locked drop too. */
smap_gather_stats(priv, vma, &mss, last_vma_end);

To make life easier I put all of my suggestions into a patch which I've
attached :) (though it is untested) let me know what you think.

> last_vma_end = vma->vm_end;
>
> /*
> * Release mmap_lock temporarily if someone wants to
> - * access it for write request.
> + * take it for write request.

I think this is better rewritten to reflect the changes. E.g.:

/*
* If the VMA lock is not taken, we hold the often contended
* mmap lock. This can be because the arch doesn't support VMA
* locks,or we had to fall back to the mmap lock.
*
* To relieve pressure, check if it is indeed contended, then
* temporarily release it.
*/

> */
> - if (mmap_lock_is_contended(mm)) {
> - vma_iter_invalidate(&vmi);
> - unlock_ctx_mm(&priv->lock_ctx);
> - ret = lock_ctx_mm(&priv->lock_ctx);
> - if (ret) {
> - release_task_mempolicy(priv);
> + if (is_mmap_lock_contended(priv)) {

We have both mm and lock_ctx already.

So we could instead do:

if (lock_ctx->mmap_locked && mmap_lock_is_contended(mm)) {
...

Right?

> + unlock_vma_range(&priv->lock_ctx);

OK this confuses me - we're gated on lock_ctx->mmap_locked, so this is exactly
the same as unlock_ctx_mm(lock_ctx) right?

There's no situation here where the VMA lock would be unlocked.

Surely it'd therefore be clearer to just call unlock_ctx_mm(lock_ctx)
direct?

> + ret = lock_vma_range(m, lock_ctx);
> + if (ret)
> goto out_put_mm;

In the case of VMA locks being supported, but we fell back to mmap locks, this
seems to just be getting us back to the invariant that the RCU read lock is
held.

However, it's a bit confusing given we're explicitly checking for the
mmap_locked case, so I think it's worth a comment explaining that we might have
fallen back to mmap lock, but could still get the VMA lock on the next VMA.

E.g.:

/*
* If we are using VMA locks but fell back to an mmap lock, we may
* be able to VMA lock the next VMA, so reset the lock and try again.
*
* Otherwise, if the arch doesn't support VMA locks, this simply
* retakes the mmap lock.
*/

> - }
> -
> - /*
> - * After dropping the lock, there are four cases to
> - * consider. See the following example for explanation.
> - *
> - * +------+------+-----------+
> - * | VMA1 | VMA2 | VMA3 |
> - * +------+------+-----------+
> - * | | | |
> - * 4k 8k 16k 400k
> - *
> - * Suppose we drop the lock after reading VMA2 due to
> - * contention, then we get:
> - *
> - * last_vma_end = 16k
> - *
> - * 1) VMA2 is freed, but VMA3 exists:
> - *
> - * vma_next(vmi) will return VMA3.
> - * In this case, just continue from VMA3.
> - *
> - * 2) VMA2 still exists:
> - *
> - * vma_next(vmi) will return VMA3.
> - * In this case, just continue from VMA3.
> - *
> - * 3) No more VMAs can be found:
> - *
> - * vma_next(vmi) will return NULL.
> - * No more things to do, just break.
> - *
> - * 4) (last_vma_end - 1) is the middle of a vma (VMA'):
> - *
> - * vma_next(vmi) will return VMA' whose range
> - * contains last_vma_end.
> - * Iterate VMA' from last_vma_end.
> - */
> - vma = vma_next(&vmi);
> - /* Case 3 above */
> - if (!vma)
> - break;
> -
> - /* Case 1 and 2 above */
> - if (vma->vm_start >= last_vma_end) {
> - smap_gather_stats(priv, vma, &mss, 0);
> - last_vma_end = vma->vm_end;
> - continue;
> - }
>
> - /* Case 4 above */
> - if (vma->vm_end > last_vma_end) {
> - smap_gather_stats(priv, vma, &mss, last_vma_end);
> - last_vma_end = vma->vm_end;
> - }

OK so you're rolling all this up into the 'if after retaking mmap_lock'
bit.

I don't love that we're losing this information, but also it's maybe a
little more than we need here and it's making

But perhaps could we transfer this into the commit message as a
later-findable justification?

> + /* Resume from the last position. */
> + pos = last_vma_end;
> + vma_iter_init(&priv->iter, mm, pos);
> }
> - } for_each_vma(vmi, vma);
> + vma = proc_get_vma(m, &pos);
> + }
>
> empty_set:
> show_vma_header_prefix(m, vma_start, last_vma_end, 0, 0, 0, 0);
> @@ -1593,10 +1577,10 @@ static int show_smaps_rollup(struct seq_file *m, void *v)
>
> __show_smap(m, &mss, true);
>
> - release_task_mempolicy(priv);
> - unlock_ctx_mm(&priv->lock_ctx);
> -
> +out_unlock:
> + unlock_vma_range(&priv->lock_ctx);
> out_put_mm:
> + release_task_mempolicy(priv);

Previously this was done under the mmap lock, now it's not. I don't think
this should be an issue but just highlighting.

> mmput(mm);
> out_put_task:
> put_task_struct(priv->task);
> --
> 2.54.0.1032.g2f8565e1d1-goog
>

Cheers, Lorenzo

----8<----