Re: [PATCH v3 26/30] maple_tree: Use maple copy node for mas_wr_split()
From: D, Suneeth
Date: Tue May 12 2026 - 02:42:36 EST
On 5/9/2026 2:48 AM, Liam R. Howlett wrote:
[You don't often get email from liam@xxxxxxxxxxxxx. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
On 26/05/08 02:12PM, D, Suneeth wrote:
Hi Liam Howlett,
On 1/31/2026 2:29 AM, Liam R. Howlett wrote:
Instead of using the maple big node, use the maple copy node for reduced
stack usage and aligning with mas_wr_rebalance() and
mas_wr_spanning_store().
Splitting a node is similar to rebalancing, but a new evaluation of when
to ascend is needed. The only other difference is that the data is
pushed and never rebalanced at each level.
The testing must also align with the changes to this commit to ensure
the test suite continues to pass.
We run will-it-scale micro-benchmark as part of our weekly CI for Kernel
Performance Regression testing between a stable vs rc kernel. We
observed will-it-scale-thread-brk1 variant was regressing with
~9% on an AMD's Turin machine between the kernels v7.0 and
v7.1-rc1. Bisecting further landed me onto this commit
280b792cac62ddadca2935766ca870b438c86323 (maple_tree: Use maple copy node
for mas_wr_split()) as the first bad
commit. The following were the machine's configuration and test
parameters used:-
Model name: AMD EPYC 64-Core Processor [Turin]
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
Total online memory: 258G
Test params:
------------
nr_task: [1 8 64 128 192 256]
mode: thread
test: brk1
kpi: per_thread_ops
cpufreq_governor: performance
The following are the stats after bisection:-
(the KPI used here is per_thread_ops)
v7.0 (baseline) %diff per_process_ops kernel_rc_ver
--------------- ----- --------------- -------------
353091 -9 321987 v7.1-rc1
353091 -7 328897 v7.0-rc5-280b792cac62(culprit)
353091 -1 347884 v7.0-rc5-11e7f22f5e85(culpritm1)
jFYI a very high level call trace from running will-it-scale-thread-brk1
which ends up in mas_wr_split goes like this,
do_brk_flags() {
may_expand_vm();
vma_merge_new_range() {
vma_expand() {
commit_merge() {
vma_iter_store_overwrite(){
mas_store_prealloc(){
mas_wr_store_entry(){
mas_wr_split(); <--- Function of interest from this patch
} /* mas_wr_store_entry */
} /* mas_store_prealloc */
} /* vma_iter_store_overwrite */
} /* commit_merge */
} /* vma_expand */
} /* do_brk_flags */
Recreation steps:
-----------------
1) git clone https://github.com/antonblanchard/will-it-scale.git
2) git clone https://github.com/intel/lkp-tests.git
3) cd will-it-scale && git apply
lkp-tests/programs/will-it-scale/pkg/will-it-scale.patch
4) make
5) python3 ./runtest.py brk1 25 thread 0 0 1 8 64 128 192 256
NOTE: [5] is specific to machine's architecture. starting from 1 is the
array of no.of tasks that you'd wish to run the testcase which here is
no.cores per CCX, per NUMA node/ per Socket, nr_threads.
Would be happy to help with further testing and providing additional
data if required.
Thank you for this report.
Considering this is brk1() in thread mode, I'm going to tell you that
this test is seriously flawed and will not produce anything that looks
reasonable. The way it is written will race all over the place and thus
is unreliable.
Thank you Liam and Matthew for your candid feedback. You're right that I should have reasoned about what brk1 in thread mode is actually measuring before treating the bisected delta as a real regression.
Does the same test in processes show a regression?
No, the same test in process does not show a significant regression.
v7.0 (baseline) %diff per_process_ops kernel_rc_ver
--------------- ----- --------------- -------------
1050189 -2 1027859 v7.1-rc1
Apologies for the noise, and thanks again for your time.
Thanks,
Liam
Thanks & Regards,
Suneeth D