Re: [PATCH v2 0/4] Static calls

From: Steven Rostedt
Date: Mon Nov 26 2018 - 15:54:18 EST



Here's the test with the attached config (A fedora distro with
localmodconfig run against it), with also two patches to implement
tracepoints with static calls. The first makes it where a tracepoint
will call a function pointer to a single callback if there's only one
callback, or an "iterator" which iterates a list of callbacks (when
there are more than one callback associated to a tracepoint).

It adds printks() to where it enables and disables the tracepoints so
expect to see a lot of output when you enable the tracepoints. This is
to verify that it's assigning the right code.

Here's what I did.

1) I first took the config and turned off CONFIG_RETPOLINE and built
v4.20-rc4 with that. I ran this to see what the affect was without
retpolines. I booted that kernel and did the following (which is also
what I did for every kernel):

# trace-cmd start -e all

To get the same affect you could also do:

# echo 1 > /sys/kernel/debug/tracing/events/enable

# perf stat -r 10 /work/c/hackbench 50

The output was this:

No RETPOLINES:

# perf stat -r 10 /work/c/hackbench 50
Time: 1.351
Time: 1.414
Time: 1.319
Time: 1.277
Time: 1.280
Time: 1.305
Time: 1.294
Time: 1.342
Time: 1.319
Time: 1.288

Performance counter stats for '/work/c/hackbench 50' (10 runs):

10,727.44 msec task-clock # 7.397 CPUs utilized ( +- 0.95% )
126,300 context-switches # 11774.138 M/sec ( +- 13.80% )
14,309 cpu-migrations # 1333.973 M/sec ( +- 8.73% )
44,073 page-faults # 4108.652 M/sec ( +- 0.68% )
39,484,799,554 cycles # 3680914.295 GHz ( +- 0.95% )
28,470,896,143 stalled-cycles-frontend # 72.11% frontend cycles idle ( +- 0.95% )
26,521,427,813 instructions # 0.67 insn per cycle
# 1.07 stalled cycles per insn ( +- 0.85% )
4,931,066,096 branches # 459691625.400 M/sec ( +- 0.87% )
19,063,801 branch-misses # 0.39% of all branches ( +- 2.05% )

1.4503 +- 0.0148 seconds time elapsed ( +- 1.02% )

Then I enabled CONFIG_RETPOLINES, built boot and ran it again:

baseline RETPOLINES:

# perf stat -r 10 /work/c/hackbench 50
Time: 1.313
Time: 1.386
Time: 1.335
Time: 1.363
Time: 1.357
Time: 1.369
Time: 1.363
Time: 1.489
Time: 1.357
Time: 1.422

Performance counter stats for '/work/c/hackbench 50' (10 runs):

11,162.24 msec task-clock # 7.383 CPUs utilized ( +- 1.11% )
112,882 context-switches # 10113.153 M/sec ( +- 15.86% )
14,255 cpu-migrations # 1277.103 M/sec ( +- 7.78% )
43,067 page-faults # 3858.393 M/sec ( +- 1.04% )
41,076,270,559 cycles # 3680042.874 GHz ( +- 1.12% )
29,669,137,584 stalled-cycles-frontend # 72.23% frontend cycles idle ( +- 1.21% )
26,647,656,812 instructions # 0.65 insn per cycle
# 1.11 stalled cycles per insn ( +- 0.81% )
5,069,504,923 branches # 454179389.091 M/sec ( +- 0.83% )
99,135,413 branch-misses # 1.96% of all branches ( +- 0.87% )

1.5120 +- 0.0133 seconds time elapsed ( +- 0.88% )


Then I applied the first tracepoint patch to make the change to call
directly (and be able to use static calls later). And tested that.

Added direct calls for trace_events:

# perf stat -r 10 /work/c/hackbench 50
Time: 1.448
Time: 1.386
Time: 1.404
Time: 1.386
Time: 1.344
Time: 1.397
Time: 1.378
Time: 1.351
Time: 1.369
Time: 1.385

Performance counter stats for '/work/c/hackbench 50' (10 runs):

11,249.28 msec task-clock # 7.382 CPUs utilized ( +- 0.64% )
112,058 context-switches # 9961.721 M/sec ( +- 11.15% )
15,535 cpu-migrations # 1381.033 M/sec ( +- 10.34% )
43,673 page-faults # 3882.433 M/sec ( +- 1.14% )
41,407,431,000 cycles # 3681020.455 GHz ( +- 0.63% )
29,842,394,154 stalled-cycles-frontend # 72.07% frontend cycles idle ( +- 0.63% )
26,669,867,181 instructions # 0.64 insn per cycle
# 1.12 stalled cycles per insn ( +- 0.58% )
5,085,122,641 branches # 452055102.392 M/sec ( +- 0.60% )
108,935,006 branch-misses # 2.14% of all branches ( +- 0.57% )

1.5239 +- 0.0139 seconds time elapsed ( +- 0.91% )


Then I added patch 1 and 2, and applied the second attached patch and
ran that:

With static calls:

# perf stat -r 10 /work/c/hackbench 50
Time: 1.407
Time: 1.424
Time: 1.352
Time: 1.355
Time: 1.361
Time: 1.416
Time: 1.453
Time: 1.353
Time: 1.341
Time: 1.439

Performance counter stats for '/work/c/hackbench 50' (10 runs):

11,293.08 msec task-clock # 7.390 CPUs utilized ( +- 0.93% )
125,343 context-switches # 11099.462 M/sec ( +- 11.84% )
15,587 cpu-migrations # 1380.272 M/sec ( +- 8.21% )
43,871 page-faults # 3884.890 M/sec ( +- 1.06% )
41,567,508,330 cycles # 3680918.499 GHz ( +- 0.94% )
29,851,271,023 stalled-cycles-frontend # 71.81% frontend cycles idle ( +- 0.99% )
26,878,085,513 instructions # 0.65 insn per cycle
# 1.11 stalled cycles per insn ( +- 0.72% )
5,125,816,911 branches # 453905346.879 M/sec ( +- 0.74% )
107,643,635 branch-misses # 2.10% of all branches ( +- 0.71% )

1.5282 +- 0.0135 seconds time elapsed ( +- 0.88% )

Then I applied patch 3 and tested that:

With static call trampolines:

# perf stat -r 10 /work/c/hackbench 50
Time: 1.350
Time: 1.333
Time: 1.369
Time: 1.361
Time: 1.375
Time: 1.352
Time: 1.316
Time: 1.336
Time: 1.339
Time: 1.371

Performance counter stats for '/work/c/hackbench 50' (10 runs):

10,964.38 msec task-clock # 7.392 CPUs utilized ( +- 0.41% )
75,986 context-switches # 6930.527 M/sec ( +- 9.23% )
12,464 cpu-migrations # 1136.858 M/sec ( +- 7.93% )
44,476 page-faults # 4056.558 M/sec ( +- 1.12% )
40,354,963,428 cycles # 3680712.468 GHz ( +- 0.42% )
29,057,240,222 stalled-cycles-frontend # 72.00% frontend cycles idle ( +- 0.46% )
26,171,883,339 instructions # 0.65 insn per cycle
# 1.11 stalled cycles per insn ( +- 0.32% )
4,978,193,830 branches # 454053195.523 M/sec ( +- 0.33% )
83,625,127 branch-misses # 1.68% of all branches ( +- 0.33% )

1.48328 +- 0.00515 seconds time elapsed ( +- 0.35% )

And finally I added patch 4 and tested that:

Full static calls:

# perf stat -r 10 /work/c/hackbench 50
Time: 1.302
Time: 1.323
Time: 1.356
Time: 1.325
Time: 1.372
Time: 1.373
Time: 1.319
Time: 1.313
Time: 1.362
Time: 1.322

Performance counter stats for '/work/c/hackbench 50' (10 runs):

10,865.10 msec task-clock # 7.373 CPUs utilized ( +- 0.62% )
88,718 context-switches # 8165.823 M/sec ( +- 10.11% )
13,463 cpu-migrations # 1239.125 M/sec ( +- 8.42% )
44,574 page-faults # 4102.673 M/sec ( +- 0.60% )
39,991,476,585 cycles # 3680897.280 GHz ( +- 0.63% )
28,713,229,777 stalled-cycles-frontend # 71.80% frontend cycles idle ( +- 0.68% )
26,289,703,633 instructions # 0.66 insn per cycle
# 1.09 stalled cycles per insn ( +- 0.44% )
4,983,099,105 branches # 458654631.123 M/sec ( +- 0.45% )
83,719,799 branch-misses # 1.68% of all branches ( +- 0.44% )

1.47364 +- 0.00706 seconds time elapsed ( +- 0.48% )


In summary, we had this:

No RETPOLINES:
1.4503 +- 0.0148 seconds time elapsed ( +- 1.02% )

baseline RETPOLINES:
1.5120 +- 0.0133 seconds time elapsed ( +- 0.88% )

Added direct calls for trace_events:
1.5239 +- 0.0139 seconds time elapsed ( +- 0.91% )

With static calls:
1.5282 +- 0.0135 seconds time elapsed ( +- 0.88% )

With static call trampolines:
1.48328 +- 0.00515 seconds time elapsed ( +- 0.35% )

Full static calls:
1.47364 +- 0.00706 seconds time elapsed ( +- 0.48% )


Adding Retpolines caused a 1.5120 / 1.4503 = 1.0425 ( 4.25% ) slowdown

Trampolines made it into 1.48328 / 1.4503 = 1.0227 ( 2.27% ) slowdown

With full static calls 1.47364 / 1.4503 = 1.0160 ( 1.6% ) slowdown

Going from 4.25 to 1.6 isn't bad, and I think this is very much worth
the effort. I did not expect it to go to 0% as there's a lot of other
places that retpolines cause issues, but this shows that it does help
the tracing code.

I originally did the tests with the development config, which has a
bunch of debugging options enabled (hackbench usually takes over 9
seconds, not the 1.5 that was done here), and the slowdown was closer
to 9% with retpolines. If people want me to do this with that, or I can
send them the config. Or better yet, the code is here, just use your
own configs.

-- Steve

Attachment: config-distro
Description: Binary data