Re: [linus:master] [mm/slub] 306c4ac989: stress-ng.seal.ops_per_sec 5.2% improvement

From: Vlastimil Babka
Date: Thu Jul 25 2024 - 06:11:56 EST


On 7/25/24 10:04 AM, kernel test robot wrote:
>
>
> Hello,
>
> kernel test robot noticed a 5.2% improvement of stress-ng.seal.ops_per_sec on:
>
>
> commit: 306c4ac9896b07b8872293eb224058ff83f81fac ("mm/slub: create kmalloc 96 and 192 caches regardless cache size order")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

Well that's great news, but also highly unlikely that the commit would cause
such an improvement, as it only optimizes a once-per-boot operation of
create_kmalloc_caches(). Maybe there are secondary effects in different
order of slab cache creation resulting in some different cpu cache layout,
but such improvement could be machine and compiler specific and overall fragile.

> testcase: stress-ng
> test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 256G memory
> parameters:
>
> nr_threads: 100%
> testtime: 60s
> test: seal
> cpufreq_governor: performance
>
>
>
>
>
>
> Details are as below:
> -------------------------------------------------------------------------------------------------->
>
>
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20240725/202407251553.12f35198-oliver.sang@xxxxxxxxx
>
> =========================================================================================
> compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
> gcc-13/performance/x86_64-rhel-8.3/100%/debian-12-x86_64-20240206.cgz/lkp-spr-r02/seal/stress-ng/60s
>
> commit:
> 844776cb65 ("mm/slub: mark racy access on slab->freelist")
> 306c4ac989 ("mm/slub: create kmalloc 96 and 192 caches regardless cache size order")
>
> 844776cb65a77ef2 306c4ac9896b07b8872293eb224
> ---------------- ---------------------------
> %stddev %change %stddev
> \ | \
> 2.51 ± 27% +1.9 4.44 ± 35% mpstat.cpu.all.idle%
> 975100 ± 19% +29.5% 1262643 ± 16% numa-meminfo.node1.AnonPages.max
> 187.06 ± 4% -11.5% 165.63 ± 10% sched_debug.cfs_rq:/.runnable_avg.stddev
> 0.05 ± 18% -40.0% 0.03 ± 58% vmstat.procs.b
> 58973718 +5.2% 62024061 stress-ng.seal.ops
> 982893 +5.2% 1033732 stress-ng.seal.ops_per_sec
> 59045344 +5.2% 62095668 stress-ng.time.minor_page_faults
> 174957 +1.4% 177400 proc-vmstat.nr_slab_unreclaimable
> 63634761 +5.5% 67148443 proc-vmstat.numa_hit
> 63399995 +5.5% 66914221 proc-vmstat.numa_local
> 73601172 +6.1% 78073549 proc-vmstat.pgalloc_normal
> 59870250 +5.3% 63063514 proc-vmstat.pgfault
> 72718474 +6.0% 77106313 proc-vmstat.pgfree
> 1.983e+10 +1.3% 2.01e+10 perf-stat.i.branch-instructions
> 66023349 +5.6% 69728143 perf-stat.i.cache-misses
> 2.023e+08 +4.7% 2.117e+08 perf-stat.i.cache-references
> 7.22 -1.9% 7.08 perf-stat.i.cpi
> 9738 -5.6% 9196 perf-stat.i.cycles-between-cache-misses
> 8.799e+10 +1.6% 8.939e+10 perf-stat.i.instructions
> 0.14 +1.6% 0.14 perf-stat.i.ipc
> 8.71 +5.1% 9.16 perf-stat.i.metric.K/sec
> 983533 +4.7% 1029816 perf-stat.i.minor-faults
> 983533 +4.7% 1029816 perf-stat.i.page-faults
> 7.30 -18.4% 5.96 ± 44% perf-stat.overall.cpi
> 9735 -21.3% 7658 ± 44% perf-stat.overall.cycles-between-cache-misses
> 0.52 +0.1 0.62 ± 7% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.ftruncate64
> 0.56 +0.1 0.67 ± 7% perf-profile.calltrace.cycles-pp.ftruncate64
> 0.34 ± 70% +0.3 0.60 ± 7% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.ftruncate64
> 48.29 +0.6 48.86 perf-profile.calltrace.cycles-pp.__close
> 48.27 +0.6 48.84 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
> 48.27 +0.6 48.84 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__close
> 48.26 +0.6 48.83 perf-profile.calltrace.cycles-pp.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
> 0.00 +0.6 0.58 ± 7% perf-profile.calltrace.cycles-pp.__x64_sys_ftruncate.do_syscall_64.entry_SYSCALL_64_after_hwframe.ftruncate64
> 48.21 +0.6 48.80 perf-profile.calltrace.cycles-pp.__fput.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
> 48.03 +0.6 48.68 perf-profile.calltrace.cycles-pp.dput.__fput.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe
> 48.02 +0.6 48.66 perf-profile.calltrace.cycles-pp.__dentry_kill.dput.__fput.__x64_sys_close.do_syscall_64
> 47.76 +0.7 48.47 perf-profile.calltrace.cycles-pp.evict.__dentry_kill.dput.__fput.__x64_sys_close
> 47.19 +0.7 47.92 perf-profile.calltrace.cycles-pp._raw_spin_lock.evict.__dentry_kill.dput.__fput
> 47.11 +0.8 47.88 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.evict.__dentry_kill.dput
> 0.74 -0.3 0.48 ± 8% perf-profile.children.cycles-pp.__munmap
> 0.69 -0.2 0.44 ± 9% perf-profile.children.cycles-pp.__x64_sys_munmap
> 0.68 -0.2 0.44 ± 9% perf-profile.children.cycles-pp.__vm_munmap
> 0.68 -0.2 0.45 ± 9% perf-profile.children.cycles-pp.do_vmi_munmap
> 0.65 -0.2 0.42 ± 8% perf-profile.children.cycles-pp.do_vmi_align_munmap
> 0.44 -0.2 0.28 ± 7% perf-profile.children.cycles-pp.unmap_region
> 0.48 -0.1 0.36 ± 7% perf-profile.children.cycles-pp.asm_exc_page_fault
> 0.42 -0.1 0.32 ± 7% perf-profile.children.cycles-pp.do_user_addr_fault
> 0.42 ± 2% -0.1 0.32 ± 7% perf-profile.children.cycles-pp.exc_page_fault
> 0.38 ± 2% -0.1 0.29 ± 7% perf-profile.children.cycles-pp.handle_mm_fault
> 0.35 ± 2% -0.1 0.27 ± 7% perf-profile.children.cycles-pp.__handle_mm_fault
> 0.33 ± 2% -0.1 0.26 ± 6% perf-profile.children.cycles-pp.do_fault
> 0.21 ± 2% -0.1 0.14 ± 8% perf-profile.children.cycles-pp.lru_add_drain
> 0.22 -0.1 0.15 ± 11% perf-profile.children.cycles-pp.alloc_inode
> 0.21 ± 2% -0.1 0.15 ± 9% perf-profile.children.cycles-pp.lru_add_drain_cpu
> 0.18 ± 2% -0.1 0.12 ± 8% perf-profile.children.cycles-pp.unmap_vmas
> 0.21 ± 2% -0.1 0.14 ± 7% perf-profile.children.cycles-pp.folio_batch_move_lru
> 0.17 -0.1 0.11 ± 8% perf-profile.children.cycles-pp.unmap_page_range
> 0.16 ± 2% -0.1 0.10 ± 9% perf-profile.children.cycles-pp.zap_pte_range
> 0.16 ± 2% -0.1 0.10 ± 9% perf-profile.children.cycles-pp.zap_pmd_range
> 0.26 ± 2% -0.1 0.20 ± 7% perf-profile.children.cycles-pp.shmem_fault
> 0.50 -0.1 0.45 ± 8% perf-profile.children.cycles-pp.mmap_region
> 0.26 ± 2% -0.1 0.20 ± 7% perf-profile.children.cycles-pp.__do_fault
> 0.26 -0.1 0.21 ± 6% perf-profile.children.cycles-pp.shmem_get_folio_gfp
> 0.19 ± 2% -0.1 0.14 ± 14% perf-profile.children.cycles-pp.write
> 0.22 ± 3% -0.0 0.18 ± 5% perf-profile.children.cycles-pp.shmem_alloc_and_add_folio
> 0.11 ± 4% -0.0 0.07 ± 10% perf-profile.children.cycles-pp.mas_store_gfp
> 0.16 ± 2% -0.0 0.12 ± 11% perf-profile.children.cycles-pp.mas_wr_store_entry
> 0.14 -0.0 0.10 ± 10% perf-profile.children.cycles-pp.mas_wr_node_store
> 0.08 -0.0 0.04 ± 45% perf-profile.children.cycles-pp.msync
> 0.06 -0.0 0.02 ± 99% perf-profile.children.cycles-pp.mas_find
> 0.12 ± 4% -0.0 0.08 ± 11% perf-profile.children.cycles-pp.inode_init_always
> 0.10 ± 3% -0.0 0.07 ± 11% perf-profile.children.cycles-pp.shmem_alloc_inode
> 0.16 -0.0 0.13 ± 9% perf-profile.children.cycles-pp.__x64_sys_fcntl
> 0.11 ± 4% -0.0 0.08 ± 11% perf-profile.children.cycles-pp.shmem_file_write_iter
> 0.10 ± 4% -0.0 0.08 ± 8% perf-profile.children.cycles-pp.do_fcntl
> 0.15 -0.0 0.13 ± 8% perf-profile.children.cycles-pp.destroy_inode
> 0.16 ± 3% -0.0 0.14 ± 7% perf-profile.children.cycles-pp.folio_lruvec_lock_irqsave
> 0.22 ± 3% -0.0 0.20 ± 5% perf-profile.children.cycles-pp._raw_spin_lock_irqsave
> 0.08 -0.0 0.06 ± 11% perf-profile.children.cycles-pp.___slab_alloc
> 0.15 ± 3% -0.0 0.12 ± 8% perf-profile.children.cycles-pp.__destroy_inode
> 0.07 ± 7% -0.0 0.04 ± 45% perf-profile.children.cycles-pp.__call_rcu_common
> 0.13 ± 2% -0.0 0.11 ± 8% perf-profile.children.cycles-pp.perf_event_mmap
> 0.09 -0.0 0.07 ± 9% perf-profile.children.cycles-pp.memfd_fcntl
> 0.06 -0.0 0.04 ± 44% perf-profile.children.cycles-pp.native_irq_return_iret
> 0.08 ± 6% -0.0 0.06 ± 8% perf-profile.children.cycles-pp.shmem_add_to_page_cache
> 0.12 -0.0 0.10 ± 6% perf-profile.children.cycles-pp.perf_event_mmap_event
> 0.11 ± 3% -0.0 0.09 ± 7% perf-profile.children.cycles-pp.__lruvec_stat_mod_folio
> 0.10 -0.0 0.08 ± 8% perf-profile.children.cycles-pp.uncharge_batch
> 0.12 ± 4% -0.0 0.10 ± 6% perf-profile.children.cycles-pp.entry_SYSCALL_64
> 0.05 +0.0 0.07 ± 5% perf-profile.children.cycles-pp.__d_alloc
> 0.05 +0.0 0.07 ± 10% perf-profile.children.cycles-pp.d_alloc_pseudo
> 0.07 +0.0 0.09 ± 7% perf-profile.children.cycles-pp.file_init_path
> 0.06 ± 6% +0.0 0.08 ± 8% perf-profile.children.cycles-pp.security_file_alloc
> 0.07 ± 7% +0.0 0.09 ± 7% perf-profile.children.cycles-pp.errseq_sample
> 0.04 ± 44% +0.0 0.07 ± 10% perf-profile.children.cycles-pp.apparmor_file_alloc_security
> 0.09 +0.0 0.12 ± 5% perf-profile.children.cycles-pp.init_file
> 0.15 +0.0 0.18 ± 7% perf-profile.children.cycles-pp.common_perm_cond
> 0.15 ± 3% +0.0 0.19 ± 8% perf-profile.children.cycles-pp.security_file_truncate
> 0.20 +0.0 0.24 ± 7% perf-profile.children.cycles-pp.notify_change
> 0.06 +0.0 0.10 ± 6% perf-profile.children.cycles-pp.inode_init_owner
> 0.13 +0.0 0.18 ± 5% perf-profile.children.cycles-pp.alloc_empty_file
> 0.10 +0.1 0.16 ± 7% perf-profile.children.cycles-pp.clear_nlink
> 0.47 +0.1 0.56 ± 7% perf-profile.children.cycles-pp.do_ftruncate
> 0.49 +0.1 0.59 ± 7% perf-profile.children.cycles-pp.__x64_sys_ftruncate
> 0.59 +0.1 0.70 ± 7% perf-profile.children.cycles-pp.ftruncate64
> 0.28 +0.1 0.40 ± 6% perf-profile.children.cycles-pp.alloc_file_pseudo
> 98.62 +0.2 98.77 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
> 98.58 +0.2 98.74 perf-profile.children.cycles-pp.do_syscall_64
> 48.30 +0.6 48.86 perf-profile.children.cycles-pp.__close
> 48.26 +0.6 48.83 perf-profile.children.cycles-pp.__x64_sys_close
> 48.21 +0.6 48.80 perf-profile.children.cycles-pp.__fput
> 48.04 +0.6 48.68 perf-profile.children.cycles-pp.dput
> 48.02 +0.6 48.67 perf-profile.children.cycles-pp.__dentry_kill
> 47.77 +0.7 48.47 perf-profile.children.cycles-pp.evict
> 0.30 -0.1 0.23 ± 7% perf-profile.self.cycles-pp._raw_spin_lock
> 0.10 ± 4% -0.0 0.06 ± 7% perf-profile.self.cycles-pp.__fput
> 0.08 ± 6% -0.0 0.05 ± 8% perf-profile.self.cycles-pp.inode_init_always
> 0.06 -0.0 0.04 ± 44% perf-profile.self.cycles-pp.native_irq_return_iret
> 0.08 -0.0 0.06 ± 7% perf-profile.self.cycles-pp._raw_spin_lock_irqsave
> 0.09 -0.0 0.08 ± 4% perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
> 0.07 +0.0 0.09 ± 7% perf-profile.self.cycles-pp.__shmem_get_inode
> 0.06 ± 7% +0.0 0.09 ± 9% perf-profile.self.cycles-pp.errseq_sample
> 0.15 ± 2% +0.0 0.18 ± 7% perf-profile.self.cycles-pp.common_perm_cond
> 0.03 ± 70% +0.0 0.06 ± 7% perf-profile.self.cycles-pp.apparmor_file_alloc_security
> 0.06 +0.0 0.10 ± 7% perf-profile.self.cycles-pp.inode_init_owner
> 0.10 +0.1 0.16 ± 6% perf-profile.self.cycles-pp.clear_nlink
>
>
>
>
> Disclaimer:
> Results have been estimated based on internal Intel analysis and are provided
> for informational purposes only. Any difference in system hardware or software
> design or configuration may affect actual performance.
>
>