[PATCH 0/9] Mitigate a vmap lock contention

From: Uladzislau Rezki (Sony)
Date: Mon May 22 2023 - 07:14:34 EST


Hello, folk.

1. This is a followup of the vmap topic that was highlighted at the LSFMMBPF-2023
conference. This small serial attempts to mitigate the contention across the
vmap/vmalloc code. The problem is described here:

wget ftp://vps418301.ovh.net/incoming/Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf

The material is tagged as a v2 version. It contains extra slides about testing
the throughput, steps and comparison with a current approach.

2. Motivation.

- The vmap code is not scalled to number of CPUs and this should be fixed;
- XFS folk has complained several times that vmalloc might be contented on
their workloads:

<snip>
commit 8dc9384b7d75012856b02ff44c37566a55fc2abf
Author: Dave Chinner <dchinner@xxxxxxxxxx>
Date: Tue Jan 4 17:22:18 2022 -0800

xfs: reduce kvmalloc overhead for CIL shadow buffers

Oh, let me count the ways that the kvmalloc API sucks dog eggs.

The problem is when we are logging lots of large objects, we hit
kvmalloc really damn hard with costly order allocations, and
behaviour utterly sucks:
...
<snip>

- If we align with per-cpu allocator in terms of performance we can
remove it to simplify the vmap code. Currently we have 3 allocators
See:

<snip>
/*** Per cpu kva allocator ***/

/*
* vmap space is limited especially on 32 bit architectures. Ensure there is
* room for at least 16 percpu vmap blocks per CPU.
*/
/*
* If we had a constant VMALLOC_START and VMALLOC_END, we'd like to be able
* to #define VMALLOC_SPACE (VMALLOC_END-VMALLOC_START). Guess
* instead (we just need a rough idea)
*/
<snip>

3. Test

On my: AMD Ryzen Threadripper 3970X 32-Core Processor, i have below figures:

1-page 1-page-this-patch
1 0.576131 vs 0.555889
2 2.68376 vs 1.07895
3 4.26502 vs 1.01739
4 6.04306 vs 1.28924
5 8.04786 vs 1.57616
6 9.38844 vs 1.78142
7 9.53481 vs 2.00172
8 10.4609 vs 2.15964
9 10.6089 vs 2.484
10 11.7372 vs 2.40443
11 11.5969 vs 2.71635
12 13.053 vs 2.6162
13 12.2973 vs 2.843
14 13.1429 vs 2.85714
15 13.7348 vs 2.90691
16 14.3687 vs 3.0285
17 14.8823 vs 3.05717
18 14.9571 vs 2.98018
19 14.9127 vs 3.0951
20 16.0786 vs 3.19521
21 15.8055 vs 3.24915
22 16.8087 vs 3.2521
23 16.7298 vs 3.25698
24 17.244 vs 3.36586
25 17.8416 vs 3.39384
26 18.8008 vs 3.40907
27 18.5591 vs 3.5254
28 19.761 vs 3.55468
29 20.06 vs 3.59869
30 20.4353 vs 3.6991
31 20.9082 vs 3.73028
32 21.0865 vs 3.82904

1..32 - is a number of jobs. The results are in usec and is a vmallco()/vfree()
pair throughput.

The series is based on the v6.3 tag and considered as beta version. Please note
it does not support vread() functionality yet. So it means that it is not fully
complete.

Any input/thoughts are welcome.

Uladzislau Rezki (Sony) (9):
mm: vmalloc: Add va_alloc() helper
mm: vmalloc: Rename adjust_va_to_fit_type() function
mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c
mm: vmalloc: Add a per-CPU-zone infrastructure
mm: vmalloc: Insert busy-VA per-cpu zone
mm: vmalloc: Support multiple zones in vmallocinfo
mm: vmalloc: Insert lazy-VA per-cpu zone
mm: vmalloc: Offload free_vmap_area_lock global lock
mm: vmalloc: Scale and activate cvz_size

mm/vmalloc.c | 641 +++++++++++++++++++++++++++++++++++----------------
1 file changed, 448 insertions(+), 193 deletions(-)

--
2.30.2