Re: [tip:x86/urgent] bpf: Fix ORC unwinding in non-JIT BPF code
From: Josh Poimboeuf
Date: Mon Jul 08 2019 - 18:38:48 EST
On Mon, Jul 08, 2019 at 03:15:37PM -0700, Alexei Starovoitov wrote:
> > 2)
> >
> > After doing the first optimization, GCC then does another one which is
> > a little trickier. It replaces:
> >
> > select_insn:
> > jmp *jumptable(, %rax, 8)
> > ...
> > ALU64_ADD_X:
> > ...
> > jmp *jumptable(, %rax, 8)
> > ALU_ADD_X:
> > ...
> > jmp *jumptable(, %rax, 8)
> >
> > with
> >
> > select_insn:
> > mov jumptable, %r12
> > jmp *(%r12, %rax, 8)
> > ...
> > ALU64_ADD_X:
> > ...
> > jmp *(%r12, %rax, 8)
> > ALU_ADD_X:
> > ...
> > jmp *(%r12, %rax, 8)
> >
> > The problem is that it only moves the jumptable address into %r12
> > once, for the entire function, then it goes through multiple recursive
> > indirect jumps which rely on that %r12 value. But objtool isn't yet
> > smart enough to be able to track the value across multiple recursive
> > indirect jumps through the jump table.
> >
> > After some digging I found that the quick and easy fix is to disable
> > -fgcse. In fact, this seems to be recommended by the GCC manual, for
> > code like this:
> >
> > -fgcse
> > Perform a global common subexpression elimination pass. This
> > pass also performs global constant and copy propagation.
> >
> > Note: When compiling a program using computed gotos, a GCC
> > extension, you may get better run-time performance if you
> > disable the global common subexpression elimination pass by
> > adding -fno-gcse to the command line.
> >
> > Enabled at levels -O2, -O3, -Os.
> >
> > This code indeed relies extensively on computed gotos. I don't know
> > *why* disabling this optimization would improve performance. In fact
> > I really don't see how it could make much of a difference either way.
> >
> > Anyway, using -fno-gcse makes optimization #2 go away and makes
> > objtool happy, with only a fix for #1 needed.
> >
> > If -fno-gcse isn't an option, we might be able to fix objtool by using
> > the "first_jump_src" thing which Peter added, improving it such that
> > it also takes table jumps into account.
>
> Sorry for delay. I'm mostly offgrid until next week.
> As far as -fno-gcse.. I don't mind as long as it doesn't hurt performance.
> Which I suspect it will :(
> All these indirect gotos are there for performance.
> Single indirect goto and a bunch of jmp select_insn
> are way slower, since there is only one instruction
> for cpu branch predictor to work with.
> When every insn is followed by "jmp *jumptable"
> there is more room for cpu to speculate.
> It's been long time, but when I wrote it the difference
> between all indirect goto vs single indirect goto was almost 2x.
Just to clarify, -fno-gcse doesn't get rid of any of the indirect jumps.
It still has 166 indirect jumps. It just gets rid of the second
optimization, where the jumptable address is placed in a register.
If you have a benchmark which is relatively easy to use, I could try to
run some tests.
--
Josh