Re: [PATCH v2 0/4] Static calls

From: Steven Rostedt
Date: Mon Nov 26 2018 - 15:54:18 EST

Next message: Andrew Morton: "Re: [PATCH v2 7/7] zram: writeback throttle"
Previous message: Matthew Wilcox: "Re: [PATCHi v2] mm: put_and_wait_on_page_locked() while page is migrated"
In reply to: Josh Poimboeuf: "Re: [PATCH v2 0/4] Static calls"
Next in thread: Josh Poimboeuf: "Re: [PATCH v2 0/4] Static calls"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Here's the test with the attached config (A fedora distro with
localmodconfig run against it), with also two patches to implement
tracepoints with static calls. The first makes it where a tracepoint
will call a function pointer to a single callback if there's only one
callback, or an "iterator" which iterates a list of callbacks (when
there are more than one callback associated to a tracepoint).

It adds printks() to where it enables and disables the tracepoints so
expect to see a lot of output when you enable the tracepoints. This is
to verify that it's assigning the right code.

Here's what I did.

1) I first took the config and turned off CONFIG_RETPOLINE and built
v4.20-rc4 with that. I ran this to see what the affect was without
retpolines. I booted that kernel and did the following (which is also
what I did for every kernel):

# trace-cmd start -e all

To get the same affect you could also do:

# echo 1 > /sys/kernel/debug/tracing/events/enable

# perf stat -r 10 /work/c/hackbench 50

The output was this:

No RETPOLINES:

# perf stat -r 10 /work/c/hackbench 50
Time: 1.351
Time: 1.414
Time: 1.319
Time: 1.277
Time: 1.280
Time: 1.305
Time: 1.294
Time: 1.342
Time: 1.319
Time: 1.288

Performance counter stats for '/work/c/hackbench 50' (10 runs):

10,727.44 msec task-clock # 7.397 CPUs utilized ( +- 0.95% )
126,300 context-switches # 11774.138 M/sec ( +- 13.80% )
14,309 cpu-migrations # 1333.973 M/sec ( +- 8.73% )
44,073 page-faults # 4108.652 M/sec ( +- 0.68% )
39,484,799,554 cycles # 3680914.295 GHz ( +- 0.95% )
28,470,896,143 stalled-cycles-frontend # 72.11% frontend cycles idle ( +- 0.95% )
26,521,427,813 instructions # 0.67 insn per cycle
# 1.07 stalled cycles per insn ( +- 0.85% )
4,931,066,096 branches # 459691625.400 M/sec ( +- 0.87% )
19,063,801 branch-misses # 0.39% of all branches ( +- 2.05% )

1.4503 +- 0.0148 seconds time elapsed ( +- 1.02% )

Then I enabled CONFIG_RETPOLINES, built boot and ran it again:

baseline RETPOLINES:

# perf stat -r 10 /work/c/hackbench 50
Time: 1.313
Time: 1.386
Time: 1.335
Time: 1.363
Time: 1.357
Time: 1.369
Time: 1.363
Time: 1.489
Time: 1.357
Time: 1.422

Performance counter stats for '/work/c/hackbench 50' (10 runs):

11,162.24 msec task-clock # 7.383 CPUs utilized ( +- 1.11% )
112,882 context-switches # 10113.153 M/sec ( +- 15.86% )
14,255 cpu-migrations # 1277.103 M/sec ( +- 7.78% )
43,067 page-faults # 3858.393 M/sec ( +- 1.04% )
41,076,270,559 cycles # 3680042.874 GHz ( +- 1.12% )
29,669,137,584 stalled-cycles-frontend # 72.23% frontend cycles idle ( +- 1.21% )
26,647,656,812 instructions # 0.65 insn per cycle
# 1.11 stalled cycles per insn ( +- 0.81% )
5,069,504,923 branches # 454179389.091 M/sec ( +- 0.83% )
99,135,413 branch-misses # 1.96% of all branches ( +- 0.87% )

1.5120 +- 0.0133 seconds time elapsed ( +- 0.88% )

Then I applied the first tracepoint patch to make the change to call
directly (and be able to use static calls later). And tested that.

Added direct calls for trace_events:

# perf stat -r 10 /work/c/hackbench 50
Time: 1.448
Time: 1.386
Time: 1.404
Time: 1.386
Time: 1.344
Time: 1.397
Time: 1.378
Time: 1.351
Time: 1.369
Time: 1.385

Performance counter stats for '/work/c/hackbench 50' (10 runs):

11,249.28 msec task-clock # 7.382 CPUs utilized ( +- 0.64% )
112,058 context-switches # 9961.721 M/sec ( +- 11.15% )
15,535 cpu-migrations # 1381.033 M/sec ( +- 10.34% )
43,673 page-faults # 3882.433 M/sec ( +- 1.14% )
41,407,431,000 cycles # 3681020.455 GHz ( +- 0.63% )
29,842,394,154 stalled-cycles-frontend # 72.07% frontend cycles idle ( +- 0.63% )
26,669,867,181 instructions # 0.64 insn per cycle
# 1.12 stalled cycles per insn ( +- 0.58% )
5,085,122,641 branches # 452055102.392 M/sec ( +- 0.60% )
108,935,006 branch-misses # 2.14% of all branches ( +- 0.57% )

1.5239 +- 0.0139 seconds time elapsed ( +- 0.91% )

Then I added patch 1 and 2, and applied the second attached patch and
ran that:

With static calls:

# perf stat -r 10 /work/c/hackbench 50
Time: 1.407
Time: 1.424
Time: 1.352
Time: 1.355
Time: 1.361
Time: 1.416
Time: 1.453
Time: 1.353
Time: 1.341
Time: 1.439

Performance counter stats for '/work/c/hackbench 50' (10 runs):

11,293.08 msec task-clock # 7.390 CPUs utilized ( +- 0.93% )
125,343 context-switches # 11099.462 M/sec ( +- 11.84% )
15,587 cpu-migrations # 1380.272 M/sec ( +- 8.21% )
43,871 page-faults # 3884.890 M/sec ( +- 1.06% )
41,567,508,330 cycles # 3680918.499 GHz ( +- 0.94% )
29,851,271,023 stalled-cycles-frontend # 71.81% frontend cycles idle ( +- 0.99% )
26,878,085,513 instructions # 0.65 insn per cycle
# 1.11 stalled cycles per insn ( +- 0.72% )
5,125,816,911 branches # 453905346.879 M/sec ( +- 0.74% )
107,643,635 branch-misses # 2.10% of all branches ( +- 0.71% )

1.5282 +- 0.0135 seconds time elapsed ( +- 0.88% )

Then I applied patch 3 and tested that:

With static call trampolines:

# perf stat -r 10 /work/c/hackbench 50
Time: 1.350
Time: 1.333
Time: 1.369
Time: 1.361
Time: 1.375
Time: 1.352
Time: 1.316
Time: 1.336
Time: 1.339
Time: 1.371

Performance counter stats for '/work/c/hackbench 50' (10 runs):

10,964.38 msec task-clock # 7.392 CPUs utilized ( +- 0.41% )
75,986 context-switches # 6930.527 M/sec ( +- 9.23% )
12,464 cpu-migrations # 1136.858 M/sec ( +- 7.93% )
44,476 page-faults # 4056.558 M/sec ( +- 1.12% )
40,354,963,428 cycles # 3680712.468 GHz ( +- 0.42% )
29,057,240,222 stalled-cycles-frontend # 72.00% frontend cycles idle ( +- 0.46% )
26,171,883,339 instructions # 0.65 insn per cycle
# 1.11 stalled cycles per insn ( +- 0.32% )
4,978,193,830 branches # 454053195.523 M/sec ( +- 0.33% )
83,625,127 branch-misses # 1.68% of all branches ( +- 0.33% )

1.48328 +- 0.00515 seconds time elapsed ( +- 0.35% )

And finally I added patch 4 and tested that:

Full static calls:

# perf stat -r 10 /work/c/hackbench 50
Time: 1.302
Time: 1.323
Time: 1.356
Time: 1.325
Time: 1.372
Time: 1.373
Time: 1.319
Time: 1.313
Time: 1.362
Time: 1.322

Performance counter stats for '/work/c/hackbench 50' (10 runs):

10,865.10 msec task-clock # 7.373 CPUs utilized ( +- 0.62% )
88,718 context-switches # 8165.823 M/sec ( +- 10.11% )
13,463 cpu-migrations # 1239.125 M/sec ( +- 8.42% )
44,574 page-faults # 4102.673 M/sec ( +- 0.60% )
39,991,476,585 cycles # 3680897.280 GHz ( +- 0.63% )
28,713,229,777 stalled-cycles-frontend # 71.80% frontend cycles idle ( +- 0.68% )
26,289,703,633 instructions # 0.66 insn per cycle
# 1.09 stalled cycles per insn ( +- 0.44% )
4,983,099,105 branches # 458654631.123 M/sec ( +- 0.45% )
83,719,799 branch-misses # 1.68% of all branches ( +- 0.44% )

1.47364 +- 0.00706 seconds time elapsed ( +- 0.48% )

In summary, we had this:

No RETPOLINES:
1.4503 +- 0.0148 seconds time elapsed ( +- 1.02% )

baseline RETPOLINES:
1.5120 +- 0.0133 seconds time elapsed ( +- 0.88% )

Added direct calls for trace_events:
1.5239 +- 0.0139 seconds time elapsed ( +- 0.91% )

With static calls:
1.5282 +- 0.0135 seconds time elapsed ( +- 0.88% )

With static call trampolines:
1.48328 +- 0.00515 seconds time elapsed ( +- 0.35% )

Full static calls:
1.47364 +- 0.00706 seconds time elapsed ( +- 0.48% )

Adding Retpolines caused a 1.5120 / 1.4503 = 1.0425 ( 4.25% ) slowdown

Trampolines made it into 1.48328 / 1.4503 = 1.0227 ( 2.27% ) slowdown

With full static calls 1.47364 / 1.4503 = 1.0160 ( 1.6% ) slowdown

Going from 4.25 to 1.6 isn't bad, and I think this is very much worth
the effort. I did not expect it to go to 0% as there's a lot of other
places that retpolines cause issues, but this shows that it does help
the tracing code.

I originally did the tests with the development config, which has a
bunch of debugging options enabled (hackbench usually takes over 9
seconds, not the 1.5 that was done here), and the slowdown was closer
to 9% with retpolines. If people want me to do this with that, or I can
send them the config. Or better yet, the code is here, just use your
own configs.

-- Steve

Attachment: config-distro
Description: Binary data

Next message: Andrew Morton: "Re: [PATCH v2 7/7] zram: writeback throttle"
Previous message: Matthew Wilcox: "Re: [PATCHi v2] mm: put_and_wait_on_page_locked() while page is migrated"
In reply to: Josh Poimboeuf: "Re: [PATCH v2 0/4] Static calls"
Next in thread: Josh Poimboeuf: "Re: [PATCH v2 0/4] Static calls"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]