Re: [linus:master] [mm/rmap] 6af8cb80d3: vm-scalability.throughput 7.8% regression

From: David Hildenbrand
Date: Wed Apr 16 2025 - 05:17:00 EST

Next message: Shrikanth Hegde: "Re: [PATCH] sched: Skip useless sched_balance_running acquisition if load balance is not due"
Previous message: Simon Horman: "Re: [PATCH net-next v5 02/11] net: ti: prueth: Adds ICSSM Ethernet driver"
In reply to: David Hildenbrand: "Re: [linus:master] [mm/rmap] 6af8cb80d3: vm-scalability.throughput 7.8% regression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 16.04.25 10:07, David Hildenbrand wrote:

On 16.04.25 09:01, kernel test robot wrote:

Hello,

kernel test robot noticed a 7.8% regression of vm-scalability.throughput on:

commit: 6af8cb80d3a9a6bbd521d8a7c949b4eafb7dba5d ("mm/rmap: basic MM owner tracking for large folios (!hugetlb)")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

testcase: vm-scalability
config: x86_64-rhel-9.4
compiler: gcc-12
test machine: 256 threads 2 sockets GENUINE INTEL(R) XEON(R) (Sierra Forest) with 128G memory
parameters:

runtime: 300s
size: 8T
test: anon-cow-seq
cpufreq_governor: performance

This should be the scenario with THP enabled. At first, I thought the
problem would be contention on the per-folio spinlock, but what makes me
scratch my head is the following:

13401 -16.5% 11190 proc-vmstat.thp_fault_alloc
... 3430623 -16.5% 2864565 proc-vmstat.thp_split_pmd

If we allocate less THP, performance of the benchmark will obviously be
worse with less THPs.

We allocated 2211 less THPs and had 566058 less THP PMD->PTE remappings.

566058 / 2211 = 256, which is exactly the number of threads ->
vm-scalability fork'ed child processes.

So it was in fact the benchmark that was effectively using 16.5% less THPs.

I don't see how this patch would affect the allocation of THPs in any
way (and I don't think it does).

Thinking about this some more: Assuming both runs execute the same test executions, we would expect the number of allocated THPs to not change (unless we really have fragmentation that results in less THP getting allocated).

Assuming we run into a timeout after 300s and abort the test earlier, we could end up with a difference in executions and, therefore THP allocations.

I recall that usually we try to have the same benchmark executions and not run into the timeout (otherwise some of these stats, like THP allocations are completely unreliable).

Maybe

7.968e+09 -16.5% 6.652e+09 vm-scalability.workload

indicates that we ended up with less executions? At least the "repro-script" seems to indicate that we always execute a fixed number of executions, but maybe the repo-script is aborted by the benchmark framework.

--
Cheers,

David / dhildenb

Next message: Shrikanth Hegde: "Re: [PATCH] sched: Skip useless sched_balance_running acquisition if load balance is not due"
Previous message: Simon Horman: "Re: [PATCH net-next v5 02/11] net: ti: prueth: Adds ICSSM Ethernet driver"
In reply to: David Hildenbrand: "Re: [linus:master] [mm/rmap] 6af8cb80d3: vm-scalability.throughput 7.8% regression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]