Re: [PATCH 0/3] further damage-control lack of clone scalability

From: Matthew Wilcox

Date: Sun Nov 23 2025 - 16:45:24 EST


On Sun, Nov 23, 2025 at 05:39:16PM +0100, Mateusz Guzik wrote:
> I have some recollection we talked about this on irc long time ago.
>
> It is my *suspicion* this would be best served with a sparse bitmap +
> a hash table.

Maybe! I've heard other people speculate that would be a better data
structure. I know we switched away from a hash table for the page
cache, but that has a different usage pattern where it's common to go
from page N to page N+1, N+2, ... Other than ps, I don't think we often
have that pattern for PIDs.

> Such a solution was already present, but it got replaced by
> 95846ecf9dac5089 ("pid: replace pid bitmap implementation with IDR
> API").
>
> Commit message cites the following bench results:
> The following are the stats for ps, pstree and calling readdir on /proc
> for 10,000 processes.
>
> ps:
> With IDR API With bitmap
> real 0m1.479s 0m2.319s
> user 0m0.070s 0m0.060s
> sys 0m0.289s 0m0.516s
>
> pstree:
> With IDR API With bitmap
> real 0m1.024s 0m1.794s
> user 0m0.348s 0m0.612s
> sys 0m0.184s 0m0.264s
>
> proc:
> With IDR API With bitmap
> real 0m0.059s 0m0.074s
> user 0m0.000s 0m0.004s
> sys 0m0.016s 0m0.016s
>
> Impact on clone was not benchmarked afaics.

It shouldn't be too much effort for you to check out 95846ecf9dac5089
and 95846ecf9dac5089^ to run your benchmark on both? That would seem
like the cheapest way of assessing the performance of hash+bitmap
vs IDR.

> Regardless, in order to give whatever replacement a fair perf eval
> against idr, at least the following 2 bits need to get sorted out:
> - the self-induced repeat locking of pidmap_lock
> - high cost of kmalloc (to my understanding waiting for sheaves4all)

The nice thing about XArray (compared to IDR) is that there's no
requirement to preallocate. Only 1.6% of xa_alloc() calls result in
calling slab. The downside is that means that XArray needs to know
where its lock is (ie xa_lock) so that it can drop the lock in order to
allocate without using GFP_ATOMIC.

At one point I kind of had a plan to create a multi-xarray where you had
multiple xarrays that shared a single lock. Or maybe this sharding is
exactly what's needed; I haven't really analysed the pid locking to see
what's needed.