[PATCH 4.16 099/279] bpf, x64: fix JIT emission for dead code

From: Greg Kroah-Hartman
Date: Mon Jun 18 2018 - 05:51:18 EST

Next message: Greg Kroah-Hartman: "[PATCH 4.16 131/279] IB/rxe: add RXE_START_MASK for rxe_opcode IB_OPCODE_RC_SEND_ONLY_INV"
Previous message: Hugues FRUCHET: "Re: [PATCH 2/2] media: stm32-dcmi: add mandatory of_node_put() in success path"
In reply to: Greg Kroah-Hartman: "[PATCH 4.16 093/279] arm64: ptrace: remove addr_limit manipulation"
Next in thread: Greg Kroah-Hartman: "[PATCH 4.16 131/279] IB/rxe: add RXE_START_MASK for rxe_opcode IB_OPCODE_RC_SEND_ONLY_INV"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

4.16-stable review patch. If anyone has any objections, please let me know.

------------------

From: Gianluca Borello <g.borello@xxxxxxxxx>

[ Upstream commit 1612a981b76688c598dc944bbfbe29a2b33e3973 ]

Commit 2a5418a13fcf ("bpf: improve dead code sanitizing") replaced dead
code with a series of ja-1 instructions, for safety. That made JIT
compilation much more complex for some BPF programs. One instance of such
programs is, for example:

bool flag = false
...
/* A bunch of other code */
...
if (flag)
do_something()

In some cases llvm is not able to remove at compile time the code for
do_something(), so the generated BPF program ends up with a large amount
of dead instructions. In one specific real life example, there are two
series of ~500 and ~1000 dead instructions in the program. When the
verifier replaces them with a series of ja-1 instructions, it causes an
interesting behavior at JIT time.

During the first pass, since all the instructions are estimated at 64
bytes, the ja-1 instructions end up being translated as 5 bytes JMP
instructions (0xE9), since the jump offsets become increasingly large (>
127) as each instruction gets discovered to be 5 bytes instead of the
estimated 64.

Starting from the second pass, the first N instructions of the ja-1
sequence get translated into 2 bytes JMPs (0xEB) because the jump offsets
become <= 127 this time. In particular, N is defined as roughly 127 / (5
- 2) ~= 42. So, each further pass will make the subsequent N JMP
instructions shrink from 5 to 2 bytes, making the image shrink every time.
This means that in order to have the entire program converge, there need
to be, in the real example above, at least ~1000 / 42 ~= 24 passes just
for translating the dead code. If we add this number to the passes needed
to translate the other non dead code, it brings such program to 40+
passes, and JIT doesn't complete. Ultimately the userspace loader fails
because such BPF program was supposed to be part of a prog array owner
being JITed.

While it is certainly possible to try to refactor such programs to help
the compiler remove dead code, the behavior is not really intuitive and it
puts further burden on the BPF developer who is not expecting such
behavior. To make things worse, such programs are working just fine in all
the kernel releases prior to the ja-1 fix.

A possible approach to mitigate this behavior consists into noticing that
for ja-1 instructions we don't really need to rely on the estimated size
of the previous and current instructions, we know that a -1 BPF jump
offset can be safely translated into a 0xEB instruction with a jump offset
of -2.

Such fix brings the BPF program in the previous example to complete again
in ~9 passes.

Fixes: 2a5418a13fcf ("bpf: improve dead code sanitizing")
Signed-off-by: Gianluca Borello <g.borello@xxxxxxxxx>
Acked-by: Alexei Starovoitov <ast@xxxxxxxxxx>
Signed-off-by: Daniel Borkmann <daniel@xxxxxxxxxxxxx>
Signed-off-by: Sasha Levin <alexander.levin@xxxxxxxxxxxxx>
Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
---
arch/x86/net/bpf_jit_comp.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -992,7 +992,17 @@ emit_cond_jmp: /* convert BPF opcode to
break;

case BPF_JMP | BPF_JA:
- jmp_offset = addrs[i + insn->off] - addrs[i];
+ if (insn->off == -1)
+ /* -1 jmp instructions will always jump
+ * backwards two bytes. Explicitly handling
+ * this case avoids wasting too many passes
+ * when there are long sequences of replaced
+ * dead code.
+ */
+ jmp_offset = -2;
+ else
+ jmp_offset = addrs[i + insn->off] - addrs[i];
+
if (!jmp_offset)
/* optimize out nop jumps */
break;

Next message: Greg Kroah-Hartman: "[PATCH 4.16 131/279] IB/rxe: add RXE_START_MASK for rxe_opcode IB_OPCODE_RC_SEND_ONLY_INV"
Previous message: Hugues FRUCHET: "Re: [PATCH 2/2] media: stm32-dcmi: add mandatory of_node_put() in success path"
In reply to: Greg Kroah-Hartman: "[PATCH 4.16 093/279] arm64: ptrace: remove addr_limit manipulation"
Next in thread: Greg Kroah-Hartman: "[PATCH 4.16 131/279] IB/rxe: add RXE_START_MASK for rxe_opcode IB_OPCODE_RC_SEND_ONLY_INV"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]