Re: [tip:x86/build] x86, retpolines: Raise limit for generating indirect calls from switch-case

From: Daniel Borkmann
Date: Thu Feb 28 2019 - 11:18:35 EST

Next message: Steven Rostedt: "Re: [PATCH] tracing: Fix event filters and triggers to handle negative numbers"
Previous message: Sean Christopherson: "Re: [PATCH v3 6/8] KVM:VMX: Load Guest CET via VMCS when CET is enabled in Guest"
In reply to: H.J. Lu: "Re: [tip:x86/build] x86, retpolines: Raise limit for generating indirect calls from switch-case"
Next in thread: H.J. Lu: "Re: [tip:x86/build] x86, retpolines: Raise limit for generating indirect calls from switch-case"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 02/28/2019 01:53 PM, H.J. Lu wrote:
> On Thu, Feb 28, 2019 at 3:27 AM David Woodhouse <dwmw2@xxxxxxxxxxxxx> wrote:
>> On Thu, 2019-02-28 at 03:12 -0800, tip-bot for Daniel Borkmann wrote:
>>> Commit-ID: ce02ef06fcf7a399a6276adb83f37373d10cbbe1
>>> Gitweb: https://git.kernel.org/tip/ce02ef06fcf7a399a6276adb83f37373d10cbbe1
>>> Author: Daniel Borkmann <daniel@xxxxxxxxxxxxx>
>>> AuthorDate: Thu, 21 Feb 2019 23:19:41 +0100
>>> Committer: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
>>> CommitDate: Thu, 28 Feb 2019 12:10:31 +0100
>>>
>>> x86, retpolines: Raise limit for generating indirect calls from switch-case
>>>
>>> From networking side, there are numerous attempts to get rid of indirect
>>> calls in fast-path wherever feasible in order to avoid the cost of
>>> retpolines, for example, just to name a few:
>>>
>>> * 283c16a2dfd3 ("indirect call wrappers: helpers to speed-up indirect calls of builtin")
>>> * aaa5d90b395a ("net: use indirect call wrappers at GRO network layer")
>>> * 028e0a476684 ("net: use indirect call wrappers at GRO transport layer")
>>> * 356da6d0cde3 ("dma-mapping: bypass indirect calls for dma-direct")
>>> * 09772d92cd5a ("bpf: avoid retpoline for lookup/update/delete calls on maps")
>>> * 10870dd89e95 ("netfilter: nf_tables: add direct calls for all builtin expressions")
>>> [...]
>>>
>>> Recent work on XDP from BjÃrn and Magnus additionally found that manually
>>> transforming the XDP return code switch statement with more than 5 cases
>>> into if-else combination would result in a considerable speedup in XDP
>>> layer due to avoidance of indirect calls in CONFIG_RETPOLINE enabled
>>> builds.
>>
>> +HJL
>>
>> This is a GCC bug, surely? It should know how expensive each
>> instruction is, and choose which to use accordingly. That should be
>> true even when the indirect branch "instruction" is a retpoline, and
>> thus enormously expensive.
>>
>> I believe this is https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86952 so
>> please at least reference that bug, and be prepared to turn this hack
>> off when GCC is fixed.
>
> We couldn't find a testcase to show jump table with indirect branch
> is slower than direct branches.

Ok, I've just checked https://github.com/marxin/microbenchmark/tree/retpoline-table
with the below on top.

Makefile | 6 +++---
switch.c | 2 +-
test.c | 6 ++++--
3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/Makefile b/Makefile
index bd83233..ea81520 100644
--- a/Makefile
+++ b/Makefile
@@ -1,16 +1,16 @@
CC=gcc
CFLAGS=-g -I.
-CFLAGS+=-O2 -mindirect-branch=thunk
+CFLAGS+=-O2 -mindirect-branch=thunk-inline -mindirect-branch-register
ASFLAGS=-g

EXE=test

OBJS=test.o switch-no-table.o switch.o

-switch-no-table.o switch-no-table.s: CFLAGS += -fno-jump-tables
+switch-no-table.o switch-no-table.s: CFLAGS += --param=case-values-threshold=20

all: $(EXE)
- ./$(EXE)
+ taskset 1 ./$(EXE)

$(EXE): $(OBJS)
$(CC) -o $@ $^
diff --git a/switch.c b/switch.c
index fe0a8b0..233ec14 100644
--- a/switch.c
+++ b/switch.c
@@ -3,7 +3,7 @@ int global;
int
foo (int x)
{
- switch (x) {
+ switch (x & 0xf) {
case 0:
return 11;
case 1:
diff --git a/test.c b/test.c
index 3d1e0da..7fc22a4 100644
--- a/test.c
+++ b/test.c
@@ -15,21 +15,23 @@ main ()
unsigned long long start, end;
unsigned long long diff1, diff2;

+ global = 0;
start = __rdtscp (&i);
for (i = 0; i < LOOP; i++)
foo_no_table (i);
end = __rdtscp (&i);
diff1 = end - start;

- printf ("no jump table: %lld\n", diff1);
+ printf ("global:%d no jump table: %lld\n", global, diff1);

+ global = 0;
start = __rdtscp (&i);
for (i = 0; i < LOOP; i++)
foo (i);
end = __rdtscp (&i);
diff2 = end - start;

- printf ("jump table : %lld (%.2f%%)\n", diff2, 100.0f * diff2 / diff1);
+ printf ("global:%d jump table : %lld (%.2f%%)\n", global, diff2, 100.0f * diff2 / diff1);

return 0;
}
--
2.17.1

** This basically iterates through the cases:

Below I'm getting ~twice the time needed for jump table vs no jump table
for the flags kernel is using:

# make
gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register -c -o test.o test.c
gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register --param=case-values-threshold=20 -c -o switch-no-table.o switch-no-table.c
gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register -c -o switch.o switch.c
gcc -o test test.o switch-no-table.o switch.o
taskset 1 ./test
global:50000000 no jump table: 6329361694
global:50000000 jump table : 13745181180 (217.17%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:50000000 no jump table: 6328846466
global:50000000 jump table : 13746479870 (217.20%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:50000000 no jump table: 6326922428
global:50000000 jump table : 13745139496 (217.25%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:50000000 no jump table: 6327943506
global:50000000 jump table : 13744388354 (217.20%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:50000000 no jump table: 6332503572
global:50000000 jump table : 13729817800 (216.82%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:50000000 no jump table: 6328378006
global:50000000 jump table : 13747069902 (217.23%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:50000000 no jump table: 6326481236
global:50000000 jump table : 13749345724 (217.33%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:50000000 no jump table: 6329332628
global:50000000 jump table : 13745879704 (217.18%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:50000000 no jump table: 6327734850
global:50000000 jump table : 13746412678 (217.24%)

For comparison that both are 100% when raising limit is _not_ in use
(which is expected of course but just to make sure):

root@snat:~/microbenchmark# make
gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register -c -o test.o test.c
gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register -c -o switch-no-table.o switch-no-table.c
gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register -c -o switch.o switch.c
gcc -o test test.o switch-no-table.o switch.o
taskset 1 ./test
global:50000000 no jump table: 13704083238
global:50000000 jump table : 13746838060 (100.31%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:50000000 no jump table: 13753854740
global:50000000 jump table : 13746624470 (99.95%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:50000000 no jump table: 13707053714
global:50000000 jump table : 13746682002 (100.29%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:50000000 no jump table: 13708843624
global:50000000 jump table : 13749733040 (100.30%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:50000000 no jump table: 13707365404
global:50000000 jump table : 13747683096 (100.29%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:50000000 no jump table: 13707014114
global:50000000 jump table : 13746444272 (100.29%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:50000000 no jump table: 13709596158
global:50000000 jump table : 13750499176 (100.30%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:50000000 no jump table: 13709484118
global:50000000 jump table : 13747952446 (100.28%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:50000000 no jump table: 13708873570
global:50000000 jump table : 13748950096 (100.29%)

** Next case would be constantly hitting first switch case:

diff --git a/switch.c b/switch.c
index 233ec14..fe0a8b0 100644
--- a/switch.c
+++ b/switch.c
@@ -3,7 +3,7 @@ int global;
int
foo (int x)
{
- switch (x & 0xf) {
+ switch (x) {
case 0:
return 11;
case 1:
diff --git a/test.c b/test.c
index 7fc22a4..2849112 100644
--- a/test.c
+++ b/test.c
@@ -5,6 +5,7 @@ extern int foo (int);
extern int foo_no_table (int);

int global = 20;
+int j = 0;

#define LOOP 800000000

@@ -18,7 +19,7 @@ main ()
global = 0;
start = __rdtscp (&i);
for (i = 0; i < LOOP; i++)
- foo_no_table (i);
+ foo_no_table (j);
end = __rdtscp (&i);
diff1 = end - start;

@@ -27,7 +28,7 @@ main ()
global = 0;
start = __rdtscp (&i);
for (i = 0; i < LOOP; i++)
- foo (i);
+ foo (j);
end = __rdtscp (&i);
diff2 = end - start;

# make
gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register -c -o test.o test.c
gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register --param=case-values-threshold=20 -c -o switch-no-table.o switch-no-table.c
gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register -c -o switch.o switch.c
gcc -o test test.o switch-no-table.o switch.o
taskset 1 ./test
global:0 no jump table: 6098109200
global:0 jump table : 30717871980 (503.73%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:0 no jump table: 6097799330
global:0 jump table : 30727386270 (503.91%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:0 no jump table: 6097559796
global:0 jump table : 30715992452 (503.74%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:0 no jump table: 6098532172
global:0 jump table : 30716423870 (503.67%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:0 no jump table: 6097429586
global:0 jump table : 30715774634 (503.75%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:0 no jump table: 6097813848
global:0 jump table : 30716476820 (503.73%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:0 no jump table: 6096955736
global:0 jump table : 30715385478 (503.78%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:0 no jump table: 6096820240
global:0 jump table : 30719682434 (503.86%)

** And next case would be constantly hitting default case:

diff --git a/test.c b/test.c
index 2849112..be9bfc1 100644
--- a/test.c
+++ b/test.c
@@ -5,7 +5,7 @@ extern int foo (int);
extern int foo_no_table (int);

int global = 20;
-int j = 0;
+int j = 1000;

#define LOOP 800000000

# make
gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register -c -o test.o test.c
gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register --param=case-values-threshold=20 -c -o switch-no-table.o switch-no-table.c
gcc -g -I. -O2 -mindirect-branch=thunk-inline -mindirect-branch-register -c -o switch.o switch.c
gcc -o test test.o switch-no-table.o switch.o
taskset 1 ./test
global:0 no jump table: 6422890064
global:0 jump table : 6866072454 (106.90%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:0 no jump table: 6423267608
global:0 jump table : 6866266176 (106.90%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:0 no jump table: 6424721624
global:0 jump table : 6866607842 (106.88%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:0 no jump table: 6424225664
global:0 jump table : 6866843372 (106.89%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:0 no jump table: 6424073830
global:0 jump table : 6866467050 (106.89%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:0 no jump table: 6426515396
global:0 jump table : 6867031640 (106.85%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:0 no jump table: 6425126656
global:0 jump table : 6866352988 (106.87%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:0 no jump table: 6423040024
global:0 jump table : 6867233670 (106.92%)
root@snat:~/microbenchmark# make
taskset 1 ./test
global:0 no jump table: 6422256136
global:0 jump table : 6865902094 (106.91%)

I could also try different distributions perhaps for the case selector,
but observations match in that direction with what Bjorn et al also have
been seen in XDP case.

Thanks,
Daniel

Next message: Steven Rostedt: "Re: [PATCH] tracing: Fix event filters and triggers to handle negative numbers"
Previous message: Sean Christopherson: "Re: [PATCH v3 6/8] KVM:VMX: Load Guest CET via VMCS when CET is enabled in Guest"
In reply to: H.J. Lu: "Re: [tip:x86/build] x86, retpolines: Raise limit for generating indirect calls from switch-case"
Next in thread: H.J. Lu: "Re: [tip:x86/build] x86, retpolines: Raise limit for generating indirect calls from switch-case"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]