Re: [PATCH mm-unstable v3 6/6] mm/mglru: rework workingset protection
From: Oliver Sang
Date: Wed Dec 25 2024 - 21:51:43 EST
hi, Yu Zhao,
On Tue, Dec 24, 2024 at 12:04:44PM -0700, Yu Zhao wrote:
> On Mon, Dec 23, 2024 at 04:44:44PM +0800, kernel test robot wrote:
> >
> >
> > Hello,
> >
> > kernel test robot noticed a 5.7% regression of will-it-scale.per_process_ops on:
>
> Thanks, Oliver!
>
> > commit: 3b7734aa8458b62ecbfd785ca7918e831565006e ("[PATCH mm-unstable v3 6/6] mm/mglru: rework workingset protection")
> > url: https://github.com/intel-lab-lkp/linux/commits/Yu-Zhao/mm-mglru-clean-up-workingset/20241208-061714
> > base: v6.13-rc1
> > patch link: https://lore.kernel.org/all/20241207221522.2250311-7-yuzhao@xxxxxxxxxx/
> > patch subject: [PATCH mm-unstable v3 6/6] mm/mglru: rework workingset protection
> >
> > testcase: will-it-scale
> > config: x86_64-rhel-9.4
> > compiler: gcc-12
> > test machine: 104 threads 2 sockets (Skylake) with 192G memory
> > parameters:
> >
> > nr_task: 100%
> > mode: process
> > test: pread2
> > cpufreq_governor: performance
>
> I think this is very likely caused by my change to folio_mark_accessed()
> that unncessarily dirties cache lines shared between different cores.
>
> Could you try the following fix please?
yes, this patch can recover the performance fully (as below (1)). thanks!
Tested-by: kernel test robot <oliver.sang@xxxxxxxxx>
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-12/performance/x86_64-rhel-9.4/process/100%/debian-12-x86_64-20240206.cgz/lkp-skl-fpga01/pread2/will-it-scale
commit:
4a202aca7c ("mm/mglru: rework refault detection")
3b7734aa84 ("mm/mglru: rework workingset protection")
c5346da9fe <-- fix patch from you
4a202aca7c7d9f99 3b7734aa8458b62ecbfd785ca79 c5346da9fe00d3b303057d93fd9
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
1.03 ± 3% -0.1 0.92 ± 5% -0.0 0.98 ± 6% mpstat.cpu.all.usr%
144371 -0.5% 143667 ± 2% -2.0% 141486 vmstat.system.in
335982 -60.7% 132060 ± 15% -61.7% 128640 ± 14% proc-vmstat.nr_active_anon
335982 -60.7% 132060 ± 15% -61.7% 128640 ± 14% proc-vmstat.nr_zone_active_anon
1343709 -60.7% 528460 ± 15% -61.7% 514494 ± 14% meminfo.Active
1343709 -60.7% 528460 ± 15% -61.7% 514494 ± 14% meminfo.Active(anon)
259.96 +3.2e+05% 821511 ± 11% +3.2e+05% 829732 ± 9% meminfo.Inactive
1401961 -5.7% 1321692 ± 2% -0.1% 1399905 will-it-scale.104.processes
13479 -5.7% 12708 ± 2% -0.1% 13460 will-it-scale.per_process_ops <----- (1)
1401961 -5.7% 1321692 ± 2% -0.1% 1399905 will-it-scale.workload
138691 ± 43% -75.8% 33574 ± 55% -54.9% 62588 ± 61% numa-vmstat.node0.nr_active_anon
138691 ± 43% -75.8% 33574 ± 55% -54.9% 62588 ± 61% numa-vmstat.node0.nr_zone_active_anon
197311 ± 30% -50.1% 98494 ± 18% -66.5% 66034 ± 50% numa-vmstat.node1.nr_active_anon
197311 ± 30% -50.1% 98494 ± 18% -66.5% 66034 ± 50% numa-vmstat.node1.nr_zone_active_anon
0.29 ± 14% +20.8% 0.35 ± 7% -14.6% 0.25 ± 31% perf-sched.sch_delay.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
1.02 ± 21% +50.7% 1.54 ± 23% -10.2% 0.92 ± 19% perf-sched.sch_delay.max.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
476.63 ± 10% -12.7% 415.87 ± 28% -31.2% 327.79 ± 35% perf-sched.wait_and_delay.avg.ms.schedule_hrtimeout_range.ep_poll.do_epoll_wait.__x64_sys_epoll_wait
476.50 ± 10% -12.7% 415.80 ± 28% -31.2% 327.69 ± 35% perf-sched.wait_time.avg.ms.schedule_hrtimeout_range.ep_poll.do_epoll_wait.__x64_sys_epoll_wait
554600 ± 43% -75.8% 134360 ± 55% -54.8% 250416 ± 61% numa-meminfo.node0.Active
554600 ± 43% -75.8% 134360 ± 55% -54.8% 250416 ± 61% numa-meminfo.node0.Active(anon)
173.31 ± 70% +1.4e+05% 247821 ± 50% +1.9e+05% 338038 ± 45% numa-meminfo.node0.Inactive
789291 ± 30% -50.1% 394029 ± 18% -66.5% 264180 ± 50% numa-meminfo.node1.Active
789291 ± 30% -50.1% 394029 ± 18% -66.5% 264180 ± 50% numa-meminfo.node1.Active(anon)
86.66 ±141% +6.6e+05% 573998 ± 27% +5.7e+05% 491639 ± 33% numa-meminfo.node1.Inactive
2.657e+09 -2.2% 2.598e+09 ± 2% -2.4% 2.592e+09 ± 2% perf-stat.i.branch-instructions
1.156e+10 -2.3% 1.13e+10 ± 2% -2.5% 1.127e+10 ± 2% perf-stat.i.instructions
0.01 ± 50% -66.9% 0.00 ± 82% -72.9% 0.00 ±110% perf-stat.i.major-faults
2.648e+09 -18.7% 2.152e+09 ± 44% -2.4% 2.584e+09 ± 2% perf-stat.ps.branch-instructions
1.152e+10 -18.8% 9.358e+09 ± 44% -2.5% 1.123e+10 ± 2% perf-stat.ps.instructions
0.01 ± 50% -73.6% 0.00 ±112% -72.8% 0.00 ±110% perf-stat.ps.major-faults
38.95 -0.9 38.09 +0.0 38.96 perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.folio_wait_bit_common.shmem_get_folio_gfp.shmem_file_read_iter.vfs_read
38.83 -0.9 37.97 +0.0 38.84 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.folio_wait_bit_common.shmem_get_folio_gfp.shmem_file_read_iter
39.70 -0.8 38.86 +0.0 39.71 perf-profile.calltrace.cycles-pp.folio_wait_bit_common.shmem_get_folio_gfp.shmem_file_read_iter.vfs_read.__x64_sys_pread64
41.03 -0.8 40.26 +0.0 41.04 perf-profile.calltrace.cycles-pp.shmem_get_folio_gfp.shmem_file_read_iter.vfs_read.__x64_sys_pread64.do_syscall_64
0.91 +0.0 0.95 -0.0 0.91 ± 2% perf-profile.calltrace.cycles-pp.filemap_get_entry.shmem_get_folio_gfp.shmem_file_read_iter.vfs_read.__x64_sys_pread64
53.14 +0.5 53.66 -0.0 53.13 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_wake_bit.shmem_file_read_iter.vfs_read
53.24 +0.5 53.76 -0.0 53.23 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_wake_bit.shmem_file_read_iter.vfs_read.__x64_sys_pread64
53.84 +0.5 54.38 -0.0 53.82 perf-profile.calltrace.cycles-pp.folio_wake_bit.shmem_file_read_iter.vfs_read.__x64_sys_pread64.do_syscall_64
38.96 -0.9 38.09 +0.0 38.96 perf-profile.children.cycles-pp._raw_spin_lock_irq
39.71 -0.8 38.87 +0.0 39.72 perf-profile.children.cycles-pp.folio_wait_bit_common
41.04 -0.8 40.26 +0.0 41.05 perf-profile.children.cycles-pp.shmem_get_folio_gfp
92.00 -0.3 91.67 -0.0 92.00 perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
0.22 -0.0 0.18 ± 3% -0.0 0.22 ± 3% perf-profile.children.cycles-pp._copy_to_iter
0.22 ± 2% -0.0 0.19 ± 2% -0.0 0.22 ± 2% perf-profile.children.cycles-pp.copy_page_to_iter
0.20 ± 2% -0.0 0.16 ± 4% -0.0 0.19 ± 2% perf-profile.children.cycles-pp.rep_movs_alternative
0.91 +0.0 0.96 -0.0 0.91 ± 2% perf-profile.children.cycles-pp.filemap_get_entry
0.00 +0.3 0.35 +0.0 0.01 ±299% perf-profile.children.cycles-pp.folio_mark_accessed
53.27 +0.5 53.80 -0.0 53.26 perf-profile.children.cycles-pp._raw_spin_lock_irqsave
53.86 +0.5 54.40 -0.0 53.84 perf-profile.children.cycles-pp.folio_wake_bit
92.00 -0.3 91.67 -0.0 92.00 perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
0.19 -0.0 0.16 ± 3% +0.0 0.19 ± 2% perf-profile.self.cycles-pp.rep_movs_alternative
0.41 +0.0 0.44 +0.0 0.41 ± 3% perf-profile.self.cycles-pp.shmem_get_folio_gfp
0.37 ± 2% +0.0 0.40 +0.0 0.38 ± 2% perf-profile.self.cycles-pp.folio_wait_bit_common
0.90 +0.0 0.94 -0.0 0.90 ± 2% perf-profile.self.cycles-pp.filemap_get_entry
0.61 +0.1 0.68 +0.0 0.61 ± 2% perf-profile.self.cycles-pp.shmem_file_read_iter
0.00 +0.3 0.34 ± 2% +0.0 0.00 perf-profile.self.cycles-pp.folio_mark_accessed
>
> diff --git a/mm/swap.c b/mm/swap.c
> index 062c8565b899..54bce14fef30 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -395,7 +395,8 @@ static void lru_gen_inc_refs(struct folio *folio)
>
> do {
> if ((old_flags & LRU_REFS_MASK) == LRU_REFS_MASK) {
> - folio_set_workingset(folio);
> + if (!folio_test_workingset(folio))
> + folio_set_workingset(folio);
> return;
> }
>