Re: [PATCH 5/5] MIPS: Add support for eBPF JIT.

From: David Daney
Date: Fri May 26 2017 - 12:12:07 EST


On 05/25/2017 07:23 PM, Alexei Starovoitov wrote:
On Thu, May 25, 2017 at 05:38:26PM -0700, David Daney wrote:
Since the eBPF machine has 64-bit registers, we only support this in
64-bit kernels. As of the writing of this commit log test-bpf is showing:

test_bpf: Summary: 316 PASSED, 0 FAILED, [308/308 JIT'ed]

All current test cases are successfully compiled.

Signed-off-by: David Daney <david.daney@xxxxxxxxxx>
---
arch/mips/Kconfig | 1 +
arch/mips/net/bpf_jit.c | 1627 ++++++++++++++++++++++++++++++++++++++++++++++-
arch/mips/net/bpf_jit.h | 7 +
3 files changed, 1633 insertions(+), 2 deletions(-)

Great stuff. I wonder what is the performance difference
interpreter vs JIT

It depends if we are calling library code:

/proc/sys/net/core # echo 0 > bpf_jit_enable
/proc/sys/net/core # modprobe test-bpf test_id=275
test_bpf: #275 BPF_MAXINSNS: ld_abs+vlan_push/pop jited:0 131733 PASS
test_bpf: Summary: 1 PASSED, 0 FAILED, [0/1 JIT'ed]
/proc/sys/net/core # rmmod test-bpf
/proc/sys/net/core # echo 1 > bpf_jit_enable
/proc/sys/net/core # modprobe test-bpf test_id=275
test_bpf: #275 BPF_MAXINSNS: ld_abs+vlan_push/pop jited:1 85453 PASS
test_bpf: Summary: 1 PASSED, 0 FAILED, [1/1 JIT'ed]

About 1.5X faster.

Or doing atomic operations:

/proc/sys/net/core # rmmod test-bpf
/proc/sys/net/core # echo 0 > bpf_jit_enable
/proc/sys/net/core # modprobe test-bpf test_id=229
test_bpf: #229 STX_XADD_DW: X + 1 + 1 + 1 + ... jited:0 209020 PASS
test_bpf: Summary: 1 PASSED, 0 FAILED, [0/1 JIT'ed]
/proc/sys/net/core # rmmod test-bpf
/proc/sys/net/core # echo 1 > bpf_jit_enable
/proc/sys/net/core # modprobe test-bpf test_id=229
test_bpf: #229 STX_XADD_DW: X + 1 + 1 + 1 + ... jited:1 158004 PASS
test_bpf: Summary: 1 PASSED, 0 FAILED, [1/1 JIT'ed]

About 1.3X faster, probably limited by coherent memory system more than code quality.

Simple register operations not touching memory are best:
/proc/sys/net/core # rmmod test-bpf
/proc/sys/net/core # echo 0 > bpf_jit_enable
/proc/sys/net/core # modprobe test-bpf test_id=38
test_bpf: #38 INT: ADD 64-bit jited:0 1819 PASS
test_bpf: Summary: 1 PASSED, 0 FAILED, [0/1 JIT'ed]
/proc/sys/net/core # rmmod test-bpf
/proc/sys/net/core # echo 1 > bpf_jit_enable
/proc/sys/net/core # modprobe test-bpf test_id=38
test_bpf: #38 INT: ADD 64-bit jited:1 83 PASS
test_bpf: Summary: 1 PASSED, 0 FAILED, [1/1 JIT'ed]

This one is fairly good. 21X faster.



+ * eBPF stack frame will be something like:
+ *
+ * Entry $sp ------> +--------------------------------+
+ * | $ra (optional) |
+ * +--------------------------------+
+ * | $s0 (optional) |
+ * +--------------------------------+
+ * | $s1 (optional) |
+ * +--------------------------------+
+ * | $s2 (optional) |
+ * +--------------------------------+
+ * | $s3 (optional) |
+ * +--------------------------------+
+ * | tmp-storage (if $ra saved) |
+ * $sp + tmp_offset --> +--------------------------------+ <--BPF_REG_10
+ * | BPF_REG_10 relative storage |
+ * | MAX_BPF_STACK (optional) |
+ * | . |
+ * | . |
+ * | . |
+ * $sp --------> +--------------------------------+
+ *
+ * If BPF_REG_10 is never referenced, then the MAX_BPF_STACK sized
+ * area is not allocated.
+ */

It's especially great to see that you've put the tmp storage
above program stack and made the stack allocation optional.
At the moment I'm working on reducing bpf program stack size,
so that JIT and interpreter can use only the stack they need.
Looking at this JIT code only minimal changes will be needed.


I originally recorded the minimum and maximum offsets from BPF_REG_10 seen, and generated a minimally sized stack frame. Then I see things like:

{
"STX_XADD_DW: Test side-effects, r10: 0x12 + 0x10 = 0x22",
.u.insns_int = {
BPF_ALU64_REG(BPF_MOV, R1, R10),
BPF_ALU32_IMM(BPF_MOV, R0, 0x12),
BPF_ST_MEM(BPF_DW, R10, -40, 0x10),
BPF_STX_XADD(BPF_DW, R10, R0, -40),
BPF_ALU64_REG(BPF_MOV, R0, R10),
BPF_ALU64_REG(BPF_SUB, R0, R1),
BPF_EXIT_INSN(),
},
INTERNAL,
{ },
{ { 0, 0 } },
},

Here we see that the value of BPF_REG_10 can escape, and be used for who knows what, and we must assume the worst case.

I guess we could see if the BPF_REG_10 value ever escapes, and if it doesn't, then use an optimally sized stack frame, and only fall back to MAX_BPF_STACK if we cannot prove it is safe to do this.