Re: [PATCH bpf v3] bpf: verifier: prevent userspace memory access
From: Ilya Leoshkevich
Date: Thu Mar 21 2024 - 08:16:19 EST
On Thu, 2024-03-21 at 12:08 +0000, Puranjay Mohan wrote:
> With BPF_PROBE_MEM, BPF allows de-referencing an untrusted pointer.
> To
> thwart invalid memory accesses, the JITs add an exception table entry
> for all such accesses. But in case the src_reg + offset overflows and
> turns into a userspace address, the BPF program might read that
> memory if
> the user has mapped it.
>
> There are architectural features that prevent the kernel from
> accessing
> userspace memory, like Privileged Access Never (PAN) on ARM64,
> Supervisor Mode Access Prevention (SMAP) on x86-64, Supervisor User
> Memory access (SUM) on RISC-V, etc. But BPF should not rely on the
> existence of these features.
>
> Make the verifier add guard instructions around such memory accesses
> and
> skip the load if the address falls into the userspace region.
>
> The JITs need to implement bpf_arch_uaddress_limit() to define where
> the userspace addresses end for that architecture or TASK_SIZE is
> taken
> as default.
>
> The implementation is as follows:
>
> REG_AX = SRC_REG
> if(offset)
> REG_AX += offset;
> REG_AX >>= 32;
> if (REG_AX <= (uaddress_limit >> 32))
> DST_REG = 0;
> else
> DST_REG = *(size *)(SRC_REG + offset);
>
> Comparing just the upper 32 bits of the load address with the upper
> 32 bits of uaddress_limit implies that the values are being aligned
> down
> to a 4GB boundary before comparison.
>
> The above means that all loads with address <= uaddress_limit + 4GB
> are
> skipped. This is acceptable because there is a large hole (much
> larger
> than 4GB) between userspace and kernel space memory, therefore a
> correctly functioning BPF program should not access this 4GB memory
> above the userspace.
>
> Let's analyze what this patch does to the following fentry program
> dereferencing an untrusted pointer:
>
> SEC("fentry/tcp_v4_connect")
> int BPF_PROG(fentry_tcp_v4_connect, struct sock *sk)
> {
> *(volatile long *)sk;
> return 0;
> }
>
> BPF Program before | BPF Program after
> ------------------ | -----------------
>
> 0: (79) r1 = *(u64 *)(r1 +0) 0: (79) r1 = *(u64 *)(r1 +0)
> -------------------------------------------------------------------
> ----
> 1: (79) r1 = *(u64 *)(r1 +0) --\ 1: (bf) r11 = r1
> ----------------------------\ \ 2: (77) r11 >>= 32
> 2: (b7) r0 = 0 \ \ 3: (b5) if r11 <= 0x8000 goto
> pc+2
> 3: (95) exit \ \-> 4: (79) r1 = *(u64 *)(r1 +0)
> \ 5: (05) goto pc+1
> \ 6: (b7) r1 = 0
> \---------------------------------
> -----
> 7: (b7) r0 = 0
> 8: (95) exit
>
> As you can see from above, in the best case (off=0), 5 extra
> instructions
> are emitted.
>
> Now, we analyse the same program after it has gone through the JITs
> of
> X86-64, ARM64, and RISC-V architectures. We follow the single load
> instruction that has the untrusted pointer and see what
> instrumentation
> has been added around it.
>
> x86-64 JIT
> ==========
> JIT's Instrumentation Verifier's
> Instrumentation
> (upstream) (This patch)
> --------------------- -------------------------
> -
>
> 0: nopl 0x0(%rax,%rax,1) 0: nopl
> 0x0(%rax,%rax,1)
> 5: xchg %ax,%ax 5: xchg %ax,%ax
> 7: push %rbp 7: push %rbp
> 8: mov %rsp,%rbp 8: mov %rsp,%rbp
> b: mov 0x0(%rdi),%rdi b: mov
> 0x0(%rdi),%rdi
> -------------------------------------------------------------------
> -----
> f: movabs $0x800000000000,%r11 f: mov %rdi,%r10
> 19: cmp %r11,%rdi 12: shr $0x20,%r10
> 1c: jb 0x000000000000002a 16: cmp $0x8000,%r10
> 1e: mov %rdi,%r11 1d: jbe
> 0x0000000000000025
> 21: add $0x0,%r11 /--> 1f: mov
> 0x0(%rdi),%rdi
> 28: jae 0x000000000000002e / 23: jmp
> 0x0000000000000027
> 2a: xor %edi,%edi / 25: xor %edi,%edi
> 2c: jmp 0x0000000000000032 / /-------------------------------
> -----
> 2e: mov 0x0(%rdi),%rdi ---/ / 27: xor %eax,%eax
> ---------------------------------/ 29: leave
> 32: xor %eax,%eax 2a: ret
> 34: leave
> 35: ret
>
> The x86-64 JIT already emits some instructions to protect against
> user
> memory access. The implementation in this patch leads to a smaller
> number of instructions being emitted. In the worst case the JIT will
> emit 9 extra instructions and this patch decreases it to 7.
>
> ARM64 JIT
> =========
>
> No Intrumentation Verifier's
> Instrumentation
> (upstream) (This patch)
> ----------------- ---------------------
> -----
>
> 0: add x9, x30, #0x0 0: add x9, x30,
> #0x0
> 4: nop 4: nop
> 8: paciasp 8: paciasp
> c: stp x29, x30, [sp, #-16]! c: stp x29, x30,
> [sp, #-16]!
> 10: mov x29, sp 10: mov x29, sp
> 14: stp x19, x20, [sp, #-16]! 14: stp x19, x20,
> [sp, #-16]!
> 18: stp x21, x22, [sp, #-16]! 18: stp x21, x22,
> [sp, #-16]!
> 1c: stp x25, x26, [sp, #-16]! 1c: stp x25, x26,
> [sp, #-16]!
> 20: stp x27, x28, [sp, #-16]! 20: stp x27, x28,
> [sp, #-16]!
> 24: mov x25, sp 24: mov x25, sp
> 28: mov x26, #0x0 28: mov x26, #0x0
> 2c: sub x27, x25, #0x0 2c: sub x27, x25,
> #0x0
> 30: sub sp, sp, #0x0 30: sub sp, sp,
> #0x0
> 34: ldr x0, [x0] 34: ldr x0, [x0]
> ---------------------------------------------------------------------
> -----------
> 38: ldr x0, [x0] ----------\ 38: add x9, x0,
> #0x0
> -----------------------------------\\ 3c: lsr x9, x9, #32
> 3c: mov x7, #0x0 \\ 40: cmp x9, #0x10,
> lsl #12
> 40: mov sp, sp \\ 44: b.ls
> 0x0000000000000050
> 44: ldp x27, x28, [sp], #16 \\--> 48: ldr x0, [x0]
> 48: ldp x25, x26, [sp], #16 \ 4c: b
> 0x0000000000000054
> 4c: ldp x21, x22, [sp], #16 \ 50: mov x0, #0x0
> 50: ldp x19, x20, [sp], #16 \---------------------------
> ------------
> 54: ldp x29, x30, [sp], #16 54: mov x7, #0x0
> 58: add x0, x7, #0x0 58: mov sp, sp
> 5c: autiasp 5c: ldp x27, x28,
> [sp], #16
> 60: ret 60: ldp x25, x26,
> [sp], #16
> 64: nop 64: ldp x21, x22,
> [sp], #16
> 68: ldr x10, 0x0000000000000070 68: ldp x19, x20,
> [sp], #16
> 6c: br x10 6c: ldp x29, x30,
> [sp], #16
> 70: add x0, x7,
> #0x0
> 74: autiasp
> 78: ret
> 7c: nop
> 80: ldr x10,
> 0x0000000000000088
> 84: br x10
>
> There are 6 extra instructions added in ARM64 in the best case. This
> will
> become 7 in the worst case (off != 0).
>
> RISC-V JIT (RISCV_ISA_C Disabled)
> ==========
>
> No Intrumentation Verifier's Instrumentation
> (upstream) (This patch)
> ----------------- --------------------------
>
> 0: nop 0: nop
> 4: nop 4: nop
> 8: li a6, 33 8: li a6, 33
> c: addi sp, sp, -16 c: addi sp, sp, -16
> 10: sd s0, 8(sp) 10: sd s0, 8(sp)
> 14: addi s0, sp, 16 14: addi s0, sp, 16
> 18: ld a0, 0(a0) 18: ld a0, 0(a0)
> ---------------------------------------------------------------
> 1c: ld a0, 0(a0) --\ 1c: mv t0, a0
> --------------------------\ \ 20: srli t0, t0, 32
> 20: li a5, 0 \ \ 24: lui t1, 4096
> 24: ld s0, 8(sp) \ \ 28: sext.w t1, t1
> 28: addi sp, sp, 16 \ \ 2c: bgeu t1, t0, 12
> 2c: sext.w a0, a5 \ \--> 30: ld a0, 0(a0)
> 30: ret \ 34: j 8
> \ 38: li a0, 0
> \------------------------------
> 3c: li a5, 0
> 40: ld s0, 8(sp)
> 44: addi sp, sp, 16
> 48: sext.w a0, a5
> 4c: ret
>
> There are 7 extra instructions added in RISC-V.
>
> Fixes: 800834285361 ("bpf, arm64: Add BPF exception tables")
> Reported-by: Breno Leitao <leitao@xxxxxxxxxx>
> Suggested-by: Alexei Starovoitov <ast@xxxxxxxxxx>
> Signed-off-by: Puranjay Mohan <puranjay12@xxxxxxxxx>
> ---
> V2:
> https://lore.kernel.org/bpf/20240321101058.68530-1-puranjay12@xxxxxxxxx/
> Changes in V3:
> - Return 0 from bpf_arch_uaddress_limit() in disabled case because it
> returns u64.
> - Modify the check in verifier to no do instrumentation when
> uaddress_limit
> is 0.
>
> V1:
> https://lore.kernel.org/bpf/20240320105436.4781-1-puranjay12@xxxxxxxxx/
> Changes in V2:
> - Disable this feature on s390x.
> ---
> arch/s390/net/bpf_jit_comp.c | 5 +++
> arch/x86/net/bpf_jit_comp.c | 72 ++++------------------------------
> --
> include/linux/filter.h | 1 +
> kernel/bpf/core.c | 9 +++++
> kernel/bpf/verifier.c | 30 +++++++++++++++
> 5 files changed, 53 insertions(+), 64 deletions(-)
[...]
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 5aacb1d3c4cc..c131bee33ac3 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -2958,6 +2958,15 @@ bool __weak bpf_jit_supports_arena(void)
> return false;
> }
>
> +u64 __weak bpf_arch_uaddress_limit(void)
> +{
> +#ifdef CONFIG_64BIT
> + return TASK_SIZE;
> +#else
> + return 0;
> +#endif
> +}
> +
How about the following?
#if defined(CONFIG_64BIT) && \
defined(CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE)
Then we won't need to do anything for s390x explicitly.`
[...]