Re: [PATCHv3 4/6] mm/fault: Try to map the entire file folio in finish_fault()

From: D, Suneeth

Date: Wed Oct 29 2025 - 00:53:00 EST

Hi Kiryl Shutsemau,

On 9/23/2025 4:37 PM, Kiryl Shutsemau wrote:

From: Kiryl Shutsemau <kas@xxxxxxxxxx>

The finish_fault() function uses per-page fault for file folios. This
only occurs for file folios smaller than PMD_SIZE.

The comment suggests that this approach prevents RSS inflation.
However, it only prevents RSS accounting. The folio is still mapped to
the process, and the fact that it is mapped by a single PTE does not
affect memory pressure. Additionally, the kernel's ability to map
large folios as PMD if they are large enough does not support this
argument.

When possible, map large folios in one shot. This reduces the number of
minor page faults and allows for TLB coalescing.

Mapping large folios at once will allow the rmap code to mlock it on
add, as it will recognize that it is fully mapped and mlocking is safe.

We run will-it-scale micro-benchmark as part of our weekly CI for Kernel Performance Regression testing between a stable vs rc kernel. We were able to observe drastic performance gain on AMD platforms (Turin and Bergamo) with running the will-it-scale-process-page-fault3 variant between the kernels v6.17 and v6.18-rc1 in the range of 322-400%. Bisecting further landed me onto this commit (19773df031bcc67d5caa06bf0ddbbff40174be7a) as the first commit to cause this gain.

The following were the machines' configuration and test parameters used:-

Model name: AMD EPYC 128-Core Processor [Bergamo]
Thread(s) per core: 2
Core(s) per socket: 128
Socket(s): 1
Total online memory: 258G

Model name: AMD EPYC 64-Core Processor [Turin]
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 1
Total online memory: 258G

Test params:

nr_task: [1 8 64 128 192 256]
mode: process
test: page_fault3
kpi: per_process_ops
cpufreq_governor: performance

The following are the stats after bisection:-

KPI v6.17 %diff v6.16-rc1 %diff v6.17-with19773df031
----- ------ ----- --------- ----- --------------------
per_
process_ 936152 +322 3954402 +339 4109353
ops

I have even checked the numbers built with the patch set[1] which was a fix to the regression reported[2], to see if the gain holds good and yes indeed it is.

per_process_ops %diff (w.r.t baseline v6.17)
--------------- ----------------------------
v6.17.0-withfixpatch: 3968637 +324

[1] http://lore.kernel.org/all/20251020163054.1063646-1-kirill@xxxxxxxxxxxxx/
[2] https://lore.kernel.org/all/20251014175214.GW6188@frogsfrogsfrogs/

Recreation steps:

1) git clone https://github.com/antonblanchard/will-it-scale.git
2) git clone https://github.com/intel/lkp-tests.git
3) cd will-it-scale && git apply lkp-tests/programs/will-it-scale/pkg/will-it-scale.patch
4) make
5) python3 runtest.py page_fault3 25 process 0 0 1 8 64 128 192 256

NOTE: [5] is specific to machine's architecture. starting from 1 is the array of no.of tasks that you'd wish to run the testcase which here is no.cores per CCX, per NUMA node/ per Socket, nr_threads.

Signed-off-by: Kiryl Shutsemau <kas@xxxxxxxxxx>
Reviewed-by: Shakeel Butt <shakeel.butt@xxxxxxxxx>
Reviewed-by: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx>
---
mm/memory.c | 9 ++-------
1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 0ba4f6b71847..812a7d9f6531 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5386,13 +5386,8 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
nr_pages = folio_nr_pages(folio);
- /*
- * Using per-page fault to maintain the uffd semantics, and same
- * approach also applies to non shmem/tmpfs faults to avoid
- * inflating the RSS of the process.
- */
- if (!vma_is_shmem(vma) || unlikely(userfaultfd_armed(vma)) ||
- unlikely(needs_fallback)) {
+ /* Using per-page fault to maintain the uffd semantics */
+ if (unlikely(userfaultfd_armed(vma)) || unlikely(needs_fallback)) {
nr_pages = 1;
} else if (nr_pages > 1) {
pgoff_t idx = folio_page_idx(folio, page);

Thanks and Regards,
Suneeth D