Re: [tip:x86/urgent] bpf: Fix ORC unwinding in non-JIT BPF code

From: Alexei Starovoitov
Date: Mon Jul 08 2019 - 18:50:01 EST


On Mon, Jul 8, 2019 at 3:38 PM Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
>
> On Mon, Jul 08, 2019 at 03:15:37PM -0700, Alexei Starovoitov wrote:
> > > 2)
> > >
> > > After doing the first optimization, GCC then does another one which is
> > > a little trickier. It replaces:
> > >
> > > select_insn:
> > > jmp *jumptable(, %rax, 8)
> > > ...
> > > ALU64_ADD_X:
> > > ...
> > > jmp *jumptable(, %rax, 8)
> > > ALU_ADD_X:
> > > ...
> > > jmp *jumptable(, %rax, 8)
> > >
> > > with
> > >
> > > select_insn:
> > > mov jumptable, %r12
> > > jmp *(%r12, %rax, 8)
> > > ...
> > > ALU64_ADD_X:
> > > ...
> > > jmp *(%r12, %rax, 8)
> > > ALU_ADD_X:
> > > ...
> > > jmp *(%r12, %rax, 8)
> > >
> > > The problem is that it only moves the jumptable address into %r12
> > > once, for the entire function, then it goes through multiple recursive
> > > indirect jumps which rely on that %r12 value. But objtool isn't yet
> > > smart enough to be able to track the value across multiple recursive
> > > indirect jumps through the jump table.
> > >
> > > After some digging I found that the quick and easy fix is to disable
> > > -fgcse. In fact, this seems to be recommended by the GCC manual, for
> > > code like this:
> > >
> > > -fgcse
> > > Perform a global common subexpression elimination pass. This
> > > pass also performs global constant and copy propagation.
> > >
> > > Note: When compiling a program using computed gotos, a GCC
> > > extension, you may get better run-time performance if you
> > > disable the global common subexpression elimination pass by
> > > adding -fno-gcse to the command line.
> > >
> > > Enabled at levels -O2, -O3, -Os.
> > >
> > > This code indeed relies extensively on computed gotos. I don't know
> > > *why* disabling this optimization would improve performance. In fact
> > > I really don't see how it could make much of a difference either way.
> > >
> > > Anyway, using -fno-gcse makes optimization #2 go away and makes
> > > objtool happy, with only a fix for #1 needed.
> > >
> > > If -fno-gcse isn't an option, we might be able to fix objtool by using
> > > the "first_jump_src" thing which Peter added, improving it such that
> > > it also takes table jumps into account.
> >
> > Sorry for delay. I'm mostly offgrid until next week.
> > As far as -fno-gcse.. I don't mind as long as it doesn't hurt performance.
> > Which I suspect it will :(
> > All these indirect gotos are there for performance.
> > Single indirect goto and a bunch of jmp select_insn
> > are way slower, since there is only one instruction
> > for cpu branch predictor to work with.
> > When every insn is followed by "jmp *jumptable"
> > there is more room for cpu to speculate.
> > It's been long time, but when I wrote it the difference
> > between all indirect goto vs single indirect goto was almost 2x.
>
> Just to clarify, -fno-gcse doesn't get rid of any of the indirect jumps.
> It still has 166 indirect jumps. It just gets rid of the second
> optimization, where the jumptable address is placed in a register.

what about other functions in core.c ?
May be it's easier to teach objtool to recognize that pattern?

> If you have a benchmark which is relatively easy to use, I could try to
> run some tests.

modprobe test_bpf
selftests/bpf/test_progs
both print runtime.
Some of test_progs have high run-to-run variations though.