Re: [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression

From: Oliver Sang
Date: Mon Nov 27 2023 - 01:59:13 EST


hi, Linus,

On Sun, Nov 26, 2023 at 03:20:58PM -0800, Linus Torvalds wrote:
> On Sun, 26 Nov 2023 at 12:23, Linus Torvalds
> <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> >
> > IOW, I might have messed up some "trivial cleanup" when prepping for
> > sending it out...
>
> Bah. Famous last words. One of the "trivial cleanups" made the code
> more "obvious" by renaming the nospec mask as just "mask".
>
> And that trivial rename broke that patch *entirely*, because now that
> name shadowed the "fmode_t" mask argument.
>
> Don't even ask how long it took me to go from "I *tested* this,
> dammit, now it doesn't work at all" to "Oh God, I'm so stupid".
>
> So that nobody else would waste any time on this, attached is a new
> attempt. This time actually tested *after* the changes.

we applied the new patch upon 0ede61d858, and confirmed regression is gone,
even 3.4% better than 93faf426e3 now.

Tested-by: kernel test robot <oliver.sang@xxxxxxxxx>

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-12/performance/x86_64-rhel-8.3/thread/16/debian-11.1-x86_64-20220510.cgz/lkp-cpl-4sp2/poll2/will-it-scale

commit:
93faf426e3 ("vfs: shave work on failed file open")
0ede61d858 ("file: convert to SLAB_TYPESAFE_BY_RCU")
c712b4365b ("Improve __fget_files_rcu() code generation (and thus __fget_light())")

93faf426e3cc000c 0ede61d8589cc2d93aa78230d74 c712b4365b5b4dbe1d1380edd37
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
228481 ± 4% -4.6% 217900 ± 6% -11.7% 201857 ± 5% meminfo.DirectMap4k
89056 -2.0% 87309 -1.6% 87606 proc-vmstat.nr_slab_unreclaimable
16.28 -0.7% 16.16 -1.0% 16.12 turbostat.RAMWatt
0.01 ± 9% +58125.6% 4.17 ±175% +23253.5% 1.67 ±222% perf-sched.sch_delay.max.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
781.67 ± 10% +6.5% 832.50 ± 19% -14.3% 670.17 ± 4% perf-sched.wait_and_delay.count.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
97958 ± 7% -9.7% 88449 ± 4% -0.6% 97399 ± 4% sched_debug.cpu.avg_idle.stddev
0.00 ± 12% +24.2% 0.00 ± 17% -5.2% 0.00 ± 7% sched_debug.cpu.next_balance.stddev
6391048 -2.9% 6208584 +3.4% 6605584 will-it-scale.16.threads
399440 -2.9% 388036 +3.4% 412848 will-it-scale.per_thread_ops
6391048 -2.9% 6208584 +3.4% 6605584 will-it-scale.workload
19.99 ± 4% -2.2 17.74 +1.2 21.18 ± 2% perf-profile.calltrace.cycles-pp.fput.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
1.27 ± 5% +0.8 2.11 ± 3% +31.1 32.36 ± 2% perf-profile.calltrace.cycles-pp.__fdget.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
32.69 ± 4% +5.0 37.70 -32.7 0.00 perf-profile.calltrace.cycles-pp.__fget_light.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
0.00 +27.9 27.85 +0.0 0.00 perf-profile.calltrace.cycles-pp.__get_file_rcu.__fget_light.do_poll.do_sys_poll.__x64_sys_poll
20.00 ± 4% -2.3 17.75 +0.4 20.43 ± 2% perf-profile.children.cycles-pp.fput
0.24 ± 10% -0.1 0.18 ± 2% -0.1 0.18 ± 10% perf-profile.children.cycles-pp.syscall_return_via_sysret
1.48 ± 5% +0.5 1.98 ± 3% +30.8 32.32 ± 2% perf-profile.children.cycles-pp.__fdget
31.85 ± 4% +6.0 37.86 -31.8 0.00 perf-profile.children.cycles-pp.__fget_light
0.00 +27.7 27.67 +0.0 0.00 perf-profile.children.cycles-pp.__get_file_rcu
30.90 ± 4% -20.6 10.35 ± 2% -30.9 0.00 perf-profile.self.cycles-pp.__fget_light
19.94 ± 4% -2.4 17.53 -0.3 19.62 ± 2% perf-profile.self.cycles-pp.fput
9.81 ± 4% -2.4 7.42 ± 2% +1.7 11.51 ± 4% perf-profile.self.cycles-pp.do_poll
0.23 ± 11% -0.1 0.17 ± 4% -0.1 0.18 ± 11% perf-profile.self.cycles-pp.syscall_return_via_sysret
0.44 ± 7% +0.0 0.45 ± 5% +0.1 0.52 ± 4% perf-profile.self.cycles-pp.__poll
0.85 ± 4% +0.1 0.92 ± 3% +30.3 31.17 ± 2% perf-profile.self.cycles-pp.__fdget
0.00 +26.5 26.48 +0.0 0.00 perf-profile.self.cycles-pp.__get_file_rcu
2.146e+10 ± 2% +8.5% 2.329e+10 ± 2% -2.1% 2.101e+10 perf-stat.i.branch-instructions
0.22 ± 14% -0.0 0.19 ± 14% -0.0 0.20 ± 3% perf-stat.i.branch-miss-rate%
2.424e+10 ± 2% +4.1% 2.524e+10 ± 2% -4.7% 2.311e+10 perf-stat.i.dTLB-loads
1.404e+10 ± 2% +8.7% 1.526e+10 ± 2% -6.2% 1.316e+10 perf-stat.i.dTLB-stores
70.87 -2.3 68.59 -1.0 69.90 perf-stat.i.iTLB-load-miss-rate%
5267608 -5.5% 4979133 ± 2% -0.4% 5244253 perf-stat.i.iTLB-load-misses
2102507 +5.4% 2215725 +5.7% 2222286 perf-stat.i.iTLB-loads
18791 ± 3% +10.5% 20757 ± 2% -1.8% 18446 perf-stat.i.instructions-per-iTLB-miss
266.67 ± 2% +6.8% 284.75 ± 2% -4.1% 255.70 perf-stat.i.metric.M/sec
0.01 ± 10% -10.5% 0.01 ± 5% -1.8% 0.01 ± 6% perf-stat.overall.MPKI
0.19 -0.0 0.17 +0.0 0.20 perf-stat.overall.branch-miss-rate%
0.65 -3.1% 0.63 +6.1% 0.69 perf-stat.overall.cpi
0.00 ± 4% -0.0 0.00 ± 4% +0.0 0.00 ± 4% perf-stat.overall.dTLB-store-miss-rate%
71.48 -2.3 69.21 -1.2 70.24 perf-stat.overall.iTLB-load-miss-rate%
18757 +10.0% 20629 -3.2% 18161 perf-stat.overall.instructions-per-iTLB-miss
1.54 +3.2% 1.59 -5.8% 1.45 perf-stat.overall.ipc
4795147 +6.4% 5100406 -9.0% 4365017 perf-stat.overall.path-length
2.14e+10 ± 2% +8.5% 2.322e+10 ± 2% -2.1% 2.094e+10 perf-stat.ps.branch-instructions
2.417e+10 ± 2% +4.1% 2.516e+10 ± 2% -4.7% 2.303e+10 perf-stat.ps.dTLB-loads
1.4e+10 ± 2% +8.7% 1.522e+10 ± 2% -6.3% 1.312e+10 perf-stat.ps.dTLB-stores
5253923 -5.5% 4966218 ± 2% -0.5% 5228207 perf-stat.ps.iTLB-load-misses
2095770 +5.4% 2208605 +5.7% 2214962 perf-stat.ps.iTLB-loads
3.065e+13 +3.3% 3.167e+13 -5.9% 2.883e+13 perf-stat.total.instructions

>
> Linus