Re: [PATCH net-next 8/9] net: filter: rework/optimize internal BPF interpreter's instruction set
From: Kees Cook
Date: Fri Mar 21 2014 - 11:40:48 EST
On Fri, Mar 21, 2014 at 6:20 AM, Daniel Borkmann <dborkman@xxxxxxxxxx> wrote:
> From: Alexei Starovoitov <ast@xxxxxxxxxxxx>
>
> This patch replaces/reworks the kernel-internel BPF interpreter with
> an optimized BPF instruction set format that is modelled closer to
> mimic native instruction sets and is designed to be JITed with one to
> one mapping. Thus, the new interpreter is noticeably faster than the
> current implementation of sk_run_filter(); mainly for two reasons:
>
> 1. Fall-through jumps:
>
> BPF jump instructions are forced to go either 'true' or 'false'
> branch which causes branch-miss penalty. The new BPF jump
> instructions have only one branch and fall-through otherwise,
> which fits the CPU branch predictor logic better. `perf stat`
> shows drastic difference for branch-misses between the old and
> new code.
>
> 2. Jump-threaded implementation of interpreter vs switch
> statement:
>
> Instead of single tablejump at the top of 'switch' statement,
> gcc will now generate multiple tablejump instructions, which
> helps CPU branch predictor logic.
>
> In short, the internal format extends BPF in the following way (more
> details can be taken from the appended documentation):
>
> - Number of registers increase from 2 to 10
> - Register width increases from 32-bit to 64-bit
> - Conditional jt/jf targets replaced with jt/fall-through,
> and forward/backward jumps now possible as well
> - Adds signed > and >= insns
> - 16 4-byte stack slots for register spill-fill replaced
> with up to 512 bytes of multi-use stack space
> - Introduction of bpf_call insn and register passing convention
> for zero overhead calls from/to other kernel functions
> - Adds arithmetic right shift insn
> - Adds swab insns for 32/64-bit
> - Adds atomic_add insn
> - Old tax/txa insns are replaced with 'mov dst,src' insn
>
> Note that the verification of filters is still being done through
> sk_chk_filter(), so filters from user- or kernel space are verified
> in the same way as we do now. We reuse current BPF JIT compilers
> in a way that this upgrade would even be fine as is, but nevertheless
> allows for a successive upgrade of BPF JIT compilers to the new
> format. The internal instruction set migration is being done after
> the probing for JIT compilation, so in case JIT compilers are able
> to create a native opcode image, we're going to use that, and in all
> other cases we're doing a follow-up migration of the BPG program's
> instruction set, so that it can be transparently run in the new
> interpreter.
>
> Performance of two BPF filters generated by libpcap resp. bpf_asm
> was measured on x86_64, i386 and arm32 (other libpcap programs
> have similar performance differences):
>
> fprog #1 is taken from Documentation/networking/filter.txt:
> tcpdump -i eth0 port 22 -dd
>
> fprog #2 is taken from 'man tcpdump':
> tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
> ((tcp[12]&0xf0)>>2)) != 0)' -dd
>
> Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on the
> same SKB (cache-hit) or 10k SKBs (cache-miss); time in nsec per
> call, smaller is better:
>
> --x86_64--
> fprog #1 fprog #1 fprog #2 fprog #2
> cache-hit cache-miss cache-hit cache-miss
> old BPF 90 101 192 202
> new BPF 31 71 47 97
> old BPF jit 12 34 17 44
> new BPF jit TBD
>
> --i386--
> fprog #1 fprog #1 fprog #2 fprog #2
> cache-hit cache-miss cache-hit cache-miss
> old BPF 107 136 227 252
> new BPF 40 119 69 172
>
> --arm32--
> fprog #1 fprog #1 fprog #2 fprog #2
> cache-hit cache-miss cache-hit cache-miss
> old BPF 202 300 475 540
> new BPF 180 270 330 470
> old BPF jit 26 182 37 202
> new BPF jit TBD
>
> Thus, without changing any userland BPF filters, applications on
> top of AF_PACKET (or other families) such as libpcap/tcpdump, cls_bpf
> classifier, netfilter's xt_bpf, team driver's load-balancing mode,
> and many more will have better interpreter filtering performance.
>
> While we are replacing the internal BPF interpreter, we also need
> to convert seccomp BPF in the same step to make use of the new
> internal structure since it makes use of lower-level API details
> without being further decoupled through higher-level calls like
> sk_unattached_filter_{create,destroy}(), for example.
>
> Just as for normal socket filtering, also seccomp BPF experiences
> a time-to-verdict speedup:
>
> 05-sim-long_jumps.c of libseccomp was used as micro-benchmark:
>
> seccomp_rule_add_exact(ctx,...
> seccomp_rule_add_exact(ctx,...
>
> rc = seccomp_load(ctx);
>
> for (i = 0; i < 10000000; i++)
> syscall(199, 100);
>
> 'short filter' has 2 rules
> 'large filter' has 200 rules
>
> 'short filter' performance is slightly better on x86_64/i386/arm32
> 'large filter' is much faster on x86_64 and i386 and shows no
> difference on arm32
>
> --x86_64-- short filter
> old BPF: 2.7 sec
> 39.12% bench libc-2.15.so [.] syscall
> 8.10% bench [kernel.kallsyms] [k] sk_run_filter
> 6.31% bench [kernel.kallsyms] [k] system_call
> 5.59% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
> 4.37% bench [kernel.kallsyms] [k] trace_hardirqs_off_caller
> 3.70% bench [kernel.kallsyms] [k] __secure_computing
> 3.67% bench [kernel.kallsyms] [k] lock_is_held
> 3.03% bench [kernel.kallsyms] [k] seccomp_bpf_load
> new BPF: 2.58 sec
> 42.05% bench libc-2.15.so [.] syscall
> 6.91% bench [kernel.kallsyms] [k] system_call
> 6.25% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
> 6.07% bench [kernel.kallsyms] [k] __secure_computing
> 5.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
>
> --arm32-- short filter
> old BPF: 4.0 sec
> 39.92% bench [kernel.kallsyms] [k] vector_swi
> 16.60% bench [kernel.kallsyms] [k] sk_run_filter
> 14.66% bench libc-2.17.so [.] syscall
> 5.42% bench [kernel.kallsyms] [k] seccomp_bpf_load
> 5.10% bench [kernel.kallsyms] [k] __secure_computing
> new BPF: 3.7 sec
> 35.93% bench [kernel.kallsyms] [k] vector_swi
> 21.89% bench libc-2.17.so [.] syscall
> 13.45% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
> 6.25% bench [kernel.kallsyms] [k] __secure_computing
> 3.96% bench [kernel.kallsyms] [k] syscall_trace_exit
>
> --x86_64-- large filter
> old BPF: 8.6 seconds
> 73.38% bench [kernel.kallsyms] [k] sk_run_filter
> 10.70% bench libc-2.15.so [.] syscall
> 5.09% bench [kernel.kallsyms] [k] seccomp_bpf_load
> 1.97% bench [kernel.kallsyms] [k] system_call
> new BPF: 5.7 seconds
> 66.20% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
> 16.75% bench libc-2.15.so [.] syscall
> 3.31% bench [kernel.kallsyms] [k] system_call
> 2.88% bench [kernel.kallsyms] [k] __secure_computing
>
> --i386-- large filter
> old BPF: 5.4 sec
> new BPF: 3.8 sec
>
> --arm32-- large filter
> old BPF: 13.5 sec
> 73.88% bench [kernel.kallsyms] [k] sk_run_filter
> 10.29% bench [kernel.kallsyms] [k] vector_swi
> 6.46% bench libc-2.17.so [.] syscall
> 2.94% bench [kernel.kallsyms] [k] seccomp_bpf_load
> 1.19% bench [kernel.kallsyms] [k] __secure_computing
> 0.87% bench [kernel.kallsyms] [k] sys_getuid
> new BPF: 13.5 sec
> 76.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
> 10.98% bench [kernel.kallsyms] [k] vector_swi
> 5.87% bench libc-2.17.so [.] syscall
> 1.77% bench [kernel.kallsyms] [k] __secure_computing
> 0.93% bench [kernel.kallsyms] [k] sys_getuid
>
> BPF filters generated by seccomp are very branchy, so the new
> internal BPF performance is better than the old one. Performance
> gains will be even higher when BPF JIT is committed for the
> new structure, which is planned in future work (as successive
> JIT migrations).
>
> BPF has also been stress-tested with trinity's BPF fuzzer.
>
> Joint work with Daniel Borkmann.
>
> References: http://thread.gmane.org/gmane.linux.kernel/1665858
> Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxxxx>
> Signed-off-by: Daniel Borkmann <dborkman@xxxxxxxxxx>
> Cc: Hagen Paul Pfeifer <hagen@xxxxxxxx>
> Cc: Kees Cook <keescook@xxxxxxxxxxxx>
> Cc: Paul Moore <pmoore@xxxxxxxxxx>
> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> Cc: H. Peter Anvin <hpa@xxxxxxxxxxxxxxx>
> Cc: linux-kernel@xxxxxxxxxxxxxxx
This looks great, thanks for all the seccomp testing!
Acked-by: Kees Cook <keescook@xxxxxxxxxxxx>
-Kees
> ---
> v1 -> v10 history at:
> - http://thread.gmane.org/gmane.linux.kernel/1665858
>
> include/linux/filter.h | 66 ++-
> include/linux/seccomp.h | 1 -
> kernel/seccomp.c | 119 ++--
> net/core/filter.c | 1415 +++++++++++++++++++++++++++++++++++++----------
> 4 files changed, 1229 insertions(+), 372 deletions(-)
>
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 9bde3ed..3ea12fa 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -9,13 +9,50 @@
> #include <linux/workqueue.h>
> #include <uapi/linux/filter.h>
>
> -#ifdef CONFIG_COMPAT
> -/*
> - * A struct sock_filter is architecture independent.
> +/* Internally used and optimized filter representation with extended
> + * instruction set based on top of classic BPF.
> */
> +
> +/* instruction classes */
> +#define BPF_ALU64 0x07 /* alu mode in double word width */
> +
> +/* ld/ldx fields */
> +#define BPF_DW 0x18 /* double word */
> +#define BPF_XADD 0xc0 /* exclusive add */
> +
> +/* alu/jmp fields */
> +#define BPF_MOV 0xb0 /* mov reg to reg */
> +#define BPF_ARSH 0xc0 /* sign extending arithmetic shift right */
> +#define BPF_BSWAP 0xd0 /* swap 4 or 8 bytes of 64-bit register */
> +
> +#define BPF_JNE 0x50 /* jump != */
> +#define BPF_JSGT 0x60 /* SGT is signed '>', GT in x86 */
> +#define BPF_JSGE 0x70 /* SGE is signed '>=', GE in x86 */
> +#define BPF_CALL 0x80 /* function call */
> +
> +/* BPF has 10 general purpose 64-bit registers and stack frame. */
> +#define MAX_BPF_REG 11
> +
> +/* BPF program can access up to 512 bytes of stack space. */
> +#define MAX_BPF_STACK 512
> +
> +/* Context and stack frame pointer register positions. */
> +#define CTX_REG 1
> +#define FP_REG 10
> +
> +struct sock_filter_int {
> + __u8 code; /* opcode */
> + __u8 a_reg:4; /* dest register */
> + __u8 x_reg:4; /* source register */
> + __s16 off; /* signed offset */
> + __s32 imm; /* signed immediate constant */
> +};
> +
> +#ifdef CONFIG_COMPAT
> +/* A struct sock_filter is architecture independent. */
> struct compat_sock_fprog {
> u16 len;
> - compat_uptr_t filter; /* struct sock_filter * */
> + compat_uptr_t filter; /* struct sock_filter * */
> };
> #endif
>
> @@ -26,6 +63,7 @@ struct sock_fprog_kern {
>
> struct sk_buff;
> struct sock;
> +struct seccomp_data;
>
> struct sk_filter {
> atomic_t refcnt;
> @@ -34,9 +72,10 @@ struct sk_filter {
> struct sock_fprog_kern *orig_prog; /* Original BPF program */
> struct rcu_head rcu;
> unsigned int (*bpf_func)(const struct sk_buff *skb,
> - const struct sock_filter *filter);
> + const struct sock_filter_int *filter);
> union {
> - struct sock_filter insns[0];
> + struct sock_filter insns[0];
> + struct sock_filter_int insnsi[0];
> struct work_struct work;
> };
> };
> @@ -50,9 +89,18 @@ static inline unsigned int sk_filter_size(unsigned int proglen)
> #define sk_filter_proglen(fprog) \
> (fprog->len * sizeof(fprog->filter[0]))
>
> +#define SK_RUN_FILTER(filter, ctx) \
> + (*filter->bpf_func)(ctx, filter->insnsi)
> +
> int sk_filter(struct sock *sk, struct sk_buff *skb);
> -unsigned int sk_run_filter(const struct sk_buff *skb,
> - const struct sock_filter *filter);
> +
> +u32 sk_run_filter_int_seccomp(const struct seccomp_data *ctx,
> + const struct sock_filter_int *insni);
> +u32 sk_run_filter_int_skb(const struct sk_buff *ctx,
> + const struct sock_filter_int *insni);
> +
> +int sk_convert_filter(struct sock_filter *prog, int len,
> + struct sock_filter_int *new_prog, int *new_len);
>
> int sk_unattached_filter_create(struct sk_filter **pfp,
> struct sock_fprog *fprog);
> @@ -86,7 +134,6 @@ static inline void bpf_jit_dump(unsigned int flen, unsigned int proglen,
> print_hex_dump(KERN_ERR, "JIT code: ", DUMP_PREFIX_OFFSET,
> 16, 1, image, proglen, false);
> }
> -#define SK_RUN_FILTER(FILTER, SKB) (*FILTER->bpf_func)(SKB, FILTER->insns)
> #else
> #include <linux/slab.h>
> static inline void bpf_jit_compile(struct sk_filter *fp)
> @@ -96,7 +143,6 @@ static inline void bpf_jit_free(struct sk_filter *fp)
> {
> kfree(fp);
> }
> -#define SK_RUN_FILTER(FILTER, SKB) sk_run_filter(SKB, FILTER->insns)
> #endif
>
> static inline int bpf_tell_extensions(void)
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index 6f19cfd..4054b09 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -76,7 +76,6 @@ static inline int seccomp_mode(struct seccomp *s)
> #ifdef CONFIG_SECCOMP_FILTER
> extern void put_seccomp_filter(struct task_struct *tsk);
> extern void get_seccomp_filter(struct task_struct *tsk);
> -extern u32 seccomp_bpf_load(int off);
> #else /* CONFIG_SECCOMP_FILTER */
> static inline void put_seccomp_filter(struct task_struct *tsk)
> {
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index b7a1004..4f18e75 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -55,60 +55,33 @@ struct seccomp_filter {
> atomic_t usage;
> struct seccomp_filter *prev;
> unsigned short len; /* Instruction count */
> - struct sock_filter insns[];
> + struct sock_filter_int insnsi[];
> };
>
> /* Limit any path through the tree to 256KB worth of instructions. */
> #define MAX_INSNS_PER_PATH ((1 << 18) / sizeof(struct sock_filter))
>
> -/**
> - * get_u32 - returns a u32 offset into data
> - * @data: a unsigned 64 bit value
> - * @index: 0 or 1 to return the first or second 32-bits
> - *
> - * This inline exists to hide the length of unsigned long. If a 32-bit
> - * unsigned long is passed in, it will be extended and the top 32-bits will be
> - * 0. If it is a 64-bit unsigned long, then whatever data is resident will be
> - * properly returned.
> - *
> +/*
> * Endianness is explicitly ignored and left for BPF program authors to manage
> * as per the specific architecture.
> */
> -static inline u32 get_u32(u64 data, int index)
> +static void populate_seccomp_data(struct seccomp_data *sd)
> {
> - return ((u32 *)&data)[index];
> -}
> + struct task_struct *task = current;
> + struct pt_regs *regs = task_pt_regs(task);
>
> -/* Helper for bpf_load below. */
> -#define BPF_DATA(_name) offsetof(struct seccomp_data, _name)
> -/**
> - * bpf_load: checks and returns a pointer to the requested offset
> - * @off: offset into struct seccomp_data to load from
> - *
> - * Returns the requested 32-bits of data.
> - * seccomp_check_filter() should assure that @off is 32-bit aligned
> - * and not out of bounds. Failure to do so is a BUG.
> - */
> -u32 seccomp_bpf_load(int off)
> -{
> - struct pt_regs *regs = task_pt_regs(current);
> - if (off == BPF_DATA(nr))
> - return syscall_get_nr(current, regs);
> - if (off == BPF_DATA(arch))
> - return syscall_get_arch(current, regs);
> - if (off >= BPF_DATA(args[0]) && off < BPF_DATA(args[6])) {
> - unsigned long value;
> - int arg = (off - BPF_DATA(args[0])) / sizeof(u64);
> - int index = !!(off % sizeof(u64));
> - syscall_get_arguments(current, regs, arg, 1, &value);
> - return get_u32(value, index);
> - }
> - if (off == BPF_DATA(instruction_pointer))
> - return get_u32(KSTK_EIP(current), 0);
> - if (off == BPF_DATA(instruction_pointer) + sizeof(u32))
> - return get_u32(KSTK_EIP(current), 1);
> - /* seccomp_check_filter should make this impossible. */
> - BUG();
> + sd->nr = syscall_get_nr(task, regs);
> + sd->arch = syscall_get_arch(task, regs);
> +
> + /* Unroll syscall_get_args to help gcc on arm. */
> + syscall_get_arguments(task, regs, 0, 1, (unsigned long *) &sd->args[0]);
> + syscall_get_arguments(task, regs, 1, 1, (unsigned long *) &sd->args[1]);
> + syscall_get_arguments(task, regs, 2, 1, (unsigned long *) &sd->args[2]);
> + syscall_get_arguments(task, regs, 3, 1, (unsigned long *) &sd->args[3]);
> + syscall_get_arguments(task, regs, 4, 1, (unsigned long *) &sd->args[4]);
> + syscall_get_arguments(task, regs, 5, 1, (unsigned long *) &sd->args[5]);
> +
> + sd->instruction_pointer = KSTK_EIP(task);
> }
>
> /**
> @@ -133,17 +106,17 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
>
> switch (code) {
> case BPF_S_LD_W_ABS:
> - ftest->code = BPF_S_ANC_SECCOMP_LD_W;
> + ftest->code = BPF_LDX | BPF_W | BPF_ABS;
> /* 32-bit aligned and not out of bounds. */
> if (k >= sizeof(struct seccomp_data) || k & 3)
> return -EINVAL;
> continue;
> case BPF_S_LD_W_LEN:
> - ftest->code = BPF_S_LD_IMM;
> + ftest->code = BPF_LD | BPF_IMM;
> ftest->k = sizeof(struct seccomp_data);
> continue;
> case BPF_S_LDX_W_LEN:
> - ftest->code = BPF_S_LDX_IMM;
> + ftest->code = BPF_LDX | BPF_IMM;
> ftest->k = sizeof(struct seccomp_data);
> continue;
> /* Explicitly include allowed calls. */
> @@ -185,6 +158,7 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
> case BPF_S_JMP_JGT_X:
> case BPF_S_JMP_JSET_K:
> case BPF_S_JMP_JSET_X:
> + sk_decode_filter(ftest, ftest);
> continue;
> default:
> return -EINVAL;
> @@ -202,18 +176,21 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
> static u32 seccomp_run_filters(int syscall)
> {
> struct seccomp_filter *f;
> + struct seccomp_data sd;
> u32 ret = SECCOMP_RET_ALLOW;
>
> /* Ensure unexpected behavior doesn't result in failing open. */
> if (WARN_ON(current->seccomp.filter == NULL))
> return SECCOMP_RET_KILL;
>
> + populate_seccomp_data(&sd);
> +
> /*
> * All filters in the list are evaluated and the lowest BPF return
> * value always takes priority (ignoring the DATA).
> */
> for (f = current->seccomp.filter; f; f = f->prev) {
> - u32 cur_ret = sk_run_filter(NULL, f->insns);
> + u32 cur_ret = sk_run_filter_int_seccomp(&sd, f->insnsi);
> if ((cur_ret & SECCOMP_RET_ACTION) < (ret & SECCOMP_RET_ACTION))
> ret = cur_ret;
> }
> @@ -231,6 +208,8 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
> struct seccomp_filter *filter;
> unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
> unsigned long total_insns = fprog->len;
> + struct sock_filter *fp;
> + int new_len;
> long ret;
>
> if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
> @@ -252,28 +231,43 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
> CAP_SYS_ADMIN) != 0)
> return -EACCES;
>
> - /* Allocate a new seccomp_filter */
> - filter = kzalloc(sizeof(struct seccomp_filter) + fp_size,
> - GFP_KERNEL|__GFP_NOWARN);
> - if (!filter)
> + fp = kzalloc(fp_size, GFP_KERNEL|__GFP_NOWARN);
> + if (!fp)
> return -ENOMEM;
> - atomic_set(&filter->usage, 1);
> - filter->len = fprog->len;
>
> /* Copy the instructions from fprog. */
> ret = -EFAULT;
> - if (copy_from_user(filter->insns, fprog->filter, fp_size))
> - goto fail;
> + if (copy_from_user(fp, fprog->filter, fp_size))
> + goto free_prog;
>
> /* Check and rewrite the fprog via the skb checker */
> - ret = sk_chk_filter(filter->insns, filter->len);
> + ret = sk_chk_filter(fp, fprog->len);
> if (ret)
> - goto fail;
> + goto free_prog;
>
> /* Check and rewrite the fprog for seccomp use */
> - ret = seccomp_check_filter(filter->insns, filter->len);
> + ret = seccomp_check_filter(fp, fprog->len);
> + if (ret)
> + goto free_prog;
> +
> + /* Convert 'sock_filter' insns to 'sock_filter_int' insns */
> + ret = sk_convert_filter(fp, fprog->len, NULL, &new_len);
> + if (ret)
> + goto free_prog;
> +
> + /* Allocate a new seccomp_filter */
> + filter = kzalloc(sizeof(struct seccomp_filter) +
> + sizeof(struct sock_filter_int) * new_len,
> + GFP_KERNEL|__GFP_NOWARN);
> + if (!filter)
> + goto free_prog;
> +
> + ret = sk_convert_filter(fp, fprog->len, filter->insnsi, &new_len);
> if (ret)
> - goto fail;
> + goto free_filter;
> +
> + atomic_set(&filter->usage, 1);
> + filter->len = new_len;
>
> /*
> * If there is an existing filter, make it the prev and don't drop its
> @@ -282,8 +276,11 @@ static long seccomp_attach_filter(struct sock_fprog *fprog)
> filter->prev = current->seccomp.filter;
> current->seccomp.filter = filter;
> return 0;
> -fail:
> +
> +free_filter:
> kfree(filter);
> +free_prog:
> + kfree(fp);
> return ret;
> }
>
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 976edc6..683f1e8 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -1,11 +1,16 @@
> /*
> * Linux Socket Filter - Kernel level socket filtering
> *
> - * Author:
> - * Jay Schulist <jschlst@xxxxxxxxx>
> + * Based on the design of the Berkeley Packet Filter. The new
> + * internal format has been designed by PLUMgrid:
> *
> - * Based on the design of:
> - * - The Berkeley Packet Filter
> + * Copyright (c) 2011 - 2014 PLUMgrid, http://plumgrid.com
> + *
> + * Authors:
> + *
> + * Jay Schulist <jschlst@xxxxxxxxx>
> + * Alexei Starovoitov <ast@xxxxxxxxxxxx>
> + * Daniel Borkmann <dborkman@xxxxxxxxxx>
> *
> * This program is free software; you can redistribute it and/or
> * modify it under the terms of the GNU General Public License
> @@ -35,6 +40,7 @@
> #include <linux/timer.h>
> #include <asm/uaccess.h>
> #include <asm/unaligned.h>
> +#include <asm/byteorder.h>
> #include <linux/filter.h>
> #include <linux/ratelimit.h>
> #include <linux/seccomp.h>
> @@ -108,304 +114,1002 @@ int sk_filter(struct sock *sk, struct sk_buff *skb)
> }
> EXPORT_SYMBOL(sk_filter);
>
> +/* Base function for offset calculation. Needs to go into .text section,
> + * therefore keeping it non-static as well; will also be used by JITs
> + * anyway later on, so do not let the compiler omit it.
> + */
> +noinline u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
> +{
> + return 0;
> +}
> +
> /**
> - * sk_run_filter - run a filter on a socket
> - * @skb: buffer to run the filter on
> + * __sk_run_filter - run a filter on a given context
> + * @ctx: buffer to run the filter on
> * @fentry: filter to apply
> *
> - * Decode and apply filter instructions to the skb->data.
> - * Return length to keep, 0 for none. @skb is the data we are
> - * filtering, @filter is the array of filter instructions.
> - * Because all jumps are guaranteed to be before last instruction,
> - * and last instruction guaranteed to be a RET, we dont need to check
> - * flen. (We used to pass to this function the length of filter)
> + * Decode and apply filter instructions to the skb->data. Return length to
> + * keep, 0 for none. @ctx is the data we are operating on, @filter is the
> + * array of filter instructions.
> */
> -unsigned int sk_run_filter(const struct sk_buff *skb,
> - const struct sock_filter *fentry)
> +unsigned int __sk_run_filter(void *ctx, const struct sock_filter_int *insn)
> {
> + u64 stack[MAX_BPF_STACK / sizeof(u64)];
> + u64 regs[MAX_BPF_REG], tmp;
> void *ptr;
> - u32 A = 0; /* Accumulator */
> - u32 X = 0; /* Index Register */
> - u32 mem[BPF_MEMWORDS]; /* Scratch Memory Store */
> - u32 tmp;
> - int k;
> + int off;
> +
> +#define K insn->imm
> +#define A regs[insn->a_reg]
> +#define X regs[insn->x_reg]
> +
> +#define CONT ({insn++; goto select_insn; })
> +#define CONT_JMP ({insn++; goto select_insn; })
> +
> + static const void *jumptable[256] = {
> + [0 ... 255] = &&default_label,
> + /* Overwrite non-defaults ... */
> +#define DL(A, B, C) [A|B|C] = &&A##_##B##_##C
> + DL(BPF_ALU, BPF_ADD, BPF_X),
> + DL(BPF_ALU, BPF_ADD, BPF_K),
> + DL(BPF_ALU, BPF_SUB, BPF_X),
> + DL(BPF_ALU, BPF_SUB, BPF_K),
> + DL(BPF_ALU, BPF_AND, BPF_X),
> + DL(BPF_ALU, BPF_AND, BPF_K),
> + DL(BPF_ALU, BPF_OR, BPF_X),
> + DL(BPF_ALU, BPF_OR, BPF_K),
> + DL(BPF_ALU, BPF_LSH, BPF_X),
> + DL(BPF_ALU, BPF_LSH, BPF_K),
> + DL(BPF_ALU, BPF_RSH, BPF_X),
> + DL(BPF_ALU, BPF_RSH, BPF_K),
> + DL(BPF_ALU, BPF_XOR, BPF_X),
> + DL(BPF_ALU, BPF_XOR, BPF_K),
> + DL(BPF_ALU, BPF_MUL, BPF_X),
> + DL(BPF_ALU, BPF_MUL, BPF_K),
> + DL(BPF_ALU, BPF_MOV, BPF_X),
> + DL(BPF_ALU, BPF_MOV, BPF_K),
> + DL(BPF_ALU, BPF_DIV, BPF_X),
> + DL(BPF_ALU, BPF_DIV, BPF_K),
> + DL(BPF_ALU, BPF_MOD, BPF_X),
> + DL(BPF_ALU, BPF_MOD, BPF_K),
> + DL(BPF_ALU, BPF_BSWAP, BPF_X),
> + DL(BPF_ALU, BPF_NEG, 0),
> + DL(BPF_ALU64, BPF_ADD, BPF_X),
> + DL(BPF_ALU64, BPF_ADD, BPF_K),
> + DL(BPF_ALU64, BPF_SUB, BPF_X),
> + DL(BPF_ALU64, BPF_SUB, BPF_K),
> + DL(BPF_ALU64, BPF_AND, BPF_X),
> + DL(BPF_ALU64, BPF_AND, BPF_K),
> + DL(BPF_ALU64, BPF_OR, BPF_X),
> + DL(BPF_ALU64, BPF_OR, BPF_K),
> + DL(BPF_ALU64, BPF_LSH, BPF_X),
> + DL(BPF_ALU64, BPF_LSH, BPF_K),
> + DL(BPF_ALU64, BPF_RSH, BPF_X),
> + DL(BPF_ALU64, BPF_RSH, BPF_K),
> + DL(BPF_ALU64, BPF_XOR, BPF_X),
> + DL(BPF_ALU64, BPF_XOR, BPF_K),
> + DL(BPF_ALU64, BPF_MUL, BPF_X),
> + DL(BPF_ALU64, BPF_MUL, BPF_K),
> + DL(BPF_ALU64, BPF_MOV, BPF_X),
> + DL(BPF_ALU64, BPF_MOV, BPF_K),
> + DL(BPF_ALU64, BPF_ARSH, BPF_X),
> + DL(BPF_ALU64, BPF_ARSH, BPF_K),
> + DL(BPF_ALU64, BPF_DIV, BPF_X),
> + DL(BPF_ALU64, BPF_DIV, BPF_K),
> + DL(BPF_ALU64, BPF_MOD, BPF_X),
> + DL(BPF_ALU64, BPF_MOD, BPF_K),
> + DL(BPF_ALU64, BPF_BSWAP, BPF_X),
> + DL(BPF_ALU64, BPF_NEG, 0),
> + DL(BPF_JMP, BPF_CALL, 0),
> + DL(BPF_JMP, BPF_JA, 0),
> + DL(BPF_JMP, BPF_JEQ, BPF_X),
> + DL(BPF_JMP, BPF_JEQ, BPF_K),
> + DL(BPF_JMP, BPF_JNE, BPF_X),
> + DL(BPF_JMP, BPF_JNE, BPF_K),
> + DL(BPF_JMP, BPF_JGT, BPF_X),
> + DL(BPF_JMP, BPF_JGT, BPF_K),
> + DL(BPF_JMP, BPF_JGE, BPF_X),
> + DL(BPF_JMP, BPF_JGE, BPF_K),
> + DL(BPF_JMP, BPF_JSGT, BPF_X),
> + DL(BPF_JMP, BPF_JSGT, BPF_K),
> + DL(BPF_JMP, BPF_JSGE, BPF_X),
> + DL(BPF_JMP, BPF_JSGE, BPF_K),
> + DL(BPF_JMP, BPF_JSET, BPF_X),
> + DL(BPF_JMP, BPF_JSET, BPF_K),
> + DL(BPF_STX, BPF_MEM, BPF_B),
> + DL(BPF_STX, BPF_MEM, BPF_H),
> + DL(BPF_STX, BPF_MEM, BPF_W),
> + DL(BPF_STX, BPF_MEM, BPF_DW),
> + DL(BPF_ST, BPF_MEM, BPF_B),
> + DL(BPF_ST, BPF_MEM, BPF_H),
> + DL(BPF_ST, BPF_MEM, BPF_W),
> + DL(BPF_ST, BPF_MEM, BPF_DW),
> + DL(BPF_LDX, BPF_MEM, BPF_B),
> + DL(BPF_LDX, BPF_MEM, BPF_H),
> + DL(BPF_LDX, BPF_MEM, BPF_W),
> + DL(BPF_LDX, BPF_MEM, BPF_DW),
> + DL(BPF_STX, BPF_XADD, BPF_W),
> + DL(BPF_STX, BPF_XADD, BPF_DW),
> + DL(BPF_LD, BPF_ABS, BPF_W),
> + DL(BPF_LD, BPF_ABS, BPF_H),
> + DL(BPF_LD, BPF_ABS, BPF_B),
> + DL(BPF_LD, BPF_IND, BPF_W),
> + DL(BPF_LD, BPF_IND, BPF_H),
> + DL(BPF_LD, BPF_IND, BPF_B),
> + DL(BPF_RET, BPF_K, 0),
> +#undef DL
> + };
>
> - /*
> - * Process array of filter instructions.
> - */
> - for (;; fentry++) {
> -#if defined(CONFIG_X86_32)
> -#define K (fentry->k)
> -#else
> - const u32 K = fentry->k;
> -#endif
> -
> - switch (fentry->code) {
> - case BPF_S_ALU_ADD_X:
> - A += X;
> - continue;
> - case BPF_S_ALU_ADD_K:
> - A += K;
> - continue;
> - case BPF_S_ALU_SUB_X:
> - A -= X;
> - continue;
> - case BPF_S_ALU_SUB_K:
> - A -= K;
> - continue;
> - case BPF_S_ALU_MUL_X:
> - A *= X;
> - continue;
> - case BPF_S_ALU_MUL_K:
> - A *= K;
> - continue;
> - case BPF_S_ALU_DIV_X:
> - if (X == 0)
> - return 0;
> - A /= X;
> - continue;
> - case BPF_S_ALU_DIV_K:
> - A /= K;
> - continue;
> - case BPF_S_ALU_MOD_X:
> - if (X == 0)
> - return 0;
> - A %= X;
> - continue;
> - case BPF_S_ALU_MOD_K:
> - A %= K;
> - continue;
> - case BPF_S_ALU_AND_X:
> - A &= X;
> - continue;
> - case BPF_S_ALU_AND_K:
> - A &= K;
> - continue;
> - case BPF_S_ALU_OR_X:
> - A |= X;
> - continue;
> - case BPF_S_ALU_OR_K:
> - A |= K;
> - continue;
> - case BPF_S_ANC_ALU_XOR_X:
> - case BPF_S_ALU_XOR_X:
> - A ^= X;
> - continue;
> - case BPF_S_ALU_XOR_K:
> - A ^= K;
> - continue;
> - case BPF_S_ALU_LSH_X:
> - A <<= X;
> - continue;
> - case BPF_S_ALU_LSH_K:
> - A <<= K;
> - continue;
> - case BPF_S_ALU_RSH_X:
> - A >>= X;
> - continue;
> - case BPF_S_ALU_RSH_K:
> - A >>= K;
> - continue;
> - case BPF_S_ALU_NEG:
> - A = -A;
> - continue;
> - case BPF_S_JMP_JA:
> - fentry += K;
> - continue;
> - case BPF_S_JMP_JGT_K:
> - fentry += (A > K) ? fentry->jt : fentry->jf;
> - continue;
> - case BPF_S_JMP_JGE_K:
> - fentry += (A >= K) ? fentry->jt : fentry->jf;
> - continue;
> - case BPF_S_JMP_JEQ_K:
> - fentry += (A == K) ? fentry->jt : fentry->jf;
> - continue;
> - case BPF_S_JMP_JSET_K:
> - fentry += (A & K) ? fentry->jt : fentry->jf;
> - continue;
> - case BPF_S_JMP_JGT_X:
> - fentry += (A > X) ? fentry->jt : fentry->jf;
> - continue;
> - case BPF_S_JMP_JGE_X:
> - fentry += (A >= X) ? fentry->jt : fentry->jf;
> - continue;
> - case BPF_S_JMP_JEQ_X:
> - fentry += (A == X) ? fentry->jt : fentry->jf;
> - continue;
> - case BPF_S_JMP_JSET_X:
> - fentry += (A & X) ? fentry->jt : fentry->jf;
> - continue;
> - case BPF_S_LD_W_ABS:
> - k = K;
> -load_w:
> - ptr = load_pointer(skb, k, 4, &tmp);
> - if (ptr != NULL) {
> - A = get_unaligned_be32(ptr);
> - continue;
> - }
> - return 0;
> - case BPF_S_LD_H_ABS:
> - k = K;
> -load_h:
> - ptr = load_pointer(skb, k, 2, &tmp);
> - if (ptr != NULL) {
> - A = get_unaligned_be16(ptr);
> - continue;
> + regs[FP_REG] = (u64) (unsigned long) &stack[ARRAY_SIZE(stack)];
> + regs[CTX_REG] = (u64) (unsigned long) ctx;
> +
> +select_insn:
> + goto *jumptable[insn->code];
> +
> + /* ALU */
> +#define ALU(OPCODE, OP) \
> + BPF_ALU64_##OPCODE##_BPF_X: \
> + A = A OP X; \
> + CONT; \
> + BPF_ALU_##OPCODE##_BPF_X: \
> + A = (u32) A OP (u32) X; \
> + CONT; \
> + BPF_ALU64_##OPCODE##_BPF_K: \
> + A = A OP K; \
> + CONT; \
> + BPF_ALU_##OPCODE##_BPF_K: \
> + A = (u32) A OP (u32) K; \
> + CONT;
> +
> + ALU(BPF_ADD, +)
> + ALU(BPF_SUB, -)
> + ALU(BPF_AND, &)
> + ALU(BPF_OR, |)
> + ALU(BPF_LSH, <<)
> + ALU(BPF_RSH, >>)
> + ALU(BPF_XOR, ^)
> + ALU(BPF_MUL, *)
> +#undef ALU
> + BPF_ALU_BPF_NEG_0:
> + A = (u32) -A;
> + CONT;
> + BPF_ALU64_BPF_NEG_0:
> + A = -A;
> + CONT;
> + BPF_ALU_BPF_MOV_BPF_X:
> + A = (u32) X;
> + CONT;
> + BPF_ALU_BPF_MOV_BPF_K:
> + A = (u32) K;
> + CONT;
> + BPF_ALU64_BPF_MOV_BPF_X:
> + A = X;
> + CONT;
> + BPF_ALU64_BPF_MOV_BPF_K:
> + A = K;
> + CONT;
> + BPF_ALU64_BPF_ARSH_BPF_X:
> + (*(s64 *) &A) >>= X;
> + CONT;
> + BPF_ALU64_BPF_ARSH_BPF_K:
> + (*(s64 *) &A) >>= K;
> + CONT;
> + BPF_ALU64_BPF_MOD_BPF_X:
> + tmp = A;
> + if (X)
> + A = do_div(tmp, X);
> + CONT;
> + BPF_ALU_BPF_MOD_BPF_X:
> + tmp = (u32) A;
> + if (X)
> + A = do_div(tmp, (u32) X);
> + CONT;
> + BPF_ALU64_BPF_MOD_BPF_K:
> + tmp = A;
> + if (K)
> + A = do_div(tmp, K);
> + CONT;
> + BPF_ALU_BPF_MOD_BPF_K:
> + tmp = (u32) A;
> + if (K)
> + A = do_div(tmp, (u32) K);
> + CONT;
> + BPF_ALU64_BPF_DIV_BPF_X:
> + if (X)
> + do_div(A, X);
> + CONT;
> + BPF_ALU_BPF_DIV_BPF_X:
> + tmp = (u32) A;
> + if (X)
> + do_div(tmp, (u32) X);
> + A = (u32) tmp;
> + CONT;
> + BPF_ALU64_BPF_DIV_BPF_K:
> + if (K)
> + do_div(A, K);
> + CONT;
> + BPF_ALU_BPF_DIV_BPF_K:
> + tmp = (u32) A;
> + if (K)
> + do_div(tmp, (u32) K);
> + A = (u32) tmp;
> + CONT;
> + BPF_ALU_BPF_BSWAP_BPF_X:
> + A = swab32(A);
> + CONT;
> + BPF_ALU64_BPF_BSWAP_BPF_X:
> + A = swab64(A);
> + CONT;
> +
> + /* CALL */
> + BPF_JMP_BPF_CALL_0:
> + regs[0] = (__bpf_call_base + insn->imm)(regs[1], regs[2],
> + regs[3], regs[4],
> + regs[5]);
> + CONT;
> +
> + /* JMP */
> + BPF_JMP_BPF_JA_0:
> + insn += insn->off;
> + CONT;
> + BPF_JMP_BPF_JEQ_BPF_X:
> + if (A == X) {
> + insn += insn->off;
> + CONT_JMP;
> + }
> + CONT;
> + BPF_JMP_BPF_JEQ_BPF_K:
> + if (A == K) {
> + insn += insn->off;
> + CONT_JMP;
> + }
> + CONT;
> + BPF_JMP_BPF_JNE_BPF_X:
> + if (A != X) {
> + insn += insn->off;
> + CONT_JMP;
> + }
> + CONT;
> + BPF_JMP_BPF_JNE_BPF_K:
> + if (A != K) {
> + insn += insn->off;
> + CONT_JMP;
> + }
> + CONT;
> + BPF_JMP_BPF_JGT_BPF_X:
> + if (A > X) {
> + insn += insn->off;
> + CONT_JMP;
> + }
> + CONT;
> + BPF_JMP_BPF_JGT_BPF_K:
> + if (A > K) {
> + insn += insn->off;
> + CONT_JMP;
> + }
> + CONT;
> + BPF_JMP_BPF_JGE_BPF_X:
> + if (A >= X) {
> + insn += insn->off;
> + CONT_JMP;
> + }
> + CONT;
> + BPF_JMP_BPF_JGE_BPF_K:
> + if (A >= K) {
> + insn += insn->off;
> + CONT_JMP;
> + }
> + CONT;
> + BPF_JMP_BPF_JSGT_BPF_X:
> + if (((s64)A) > ((s64)X)) {
> + insn += insn->off;
> + CONT_JMP;
> + }
> + CONT;
> + BPF_JMP_BPF_JSGT_BPF_K:
> + if (((s64)A) > ((s64)K)) {
> + insn += insn->off;
> + CONT_JMP;
> + }
> + CONT;
> + BPF_JMP_BPF_JSGE_BPF_X:
> + if (((s64)A) >= ((s64)X)) {
> + insn += insn->off;
> + CONT_JMP;
> + }
> + CONT;
> + BPF_JMP_BPF_JSGE_BPF_K:
> + if (((s64)A) >= ((s64)K)) {
> + insn += insn->off;
> + CONT_JMP;
> + }
> + CONT;
> + BPF_JMP_BPF_JSET_BPF_X:
> + if (A & X) {
> + insn += insn->off;
> + CONT_JMP;
> + }
> + CONT;
> + BPF_JMP_BPF_JSET_BPF_K:
> + if (A & K) {
> + insn += insn->off;
> + CONT_JMP;
> + }
> + CONT;
> +
> + /* STX and ST and LDX*/
> +#define LDST(SIZEOP, SIZE) \
> + BPF_STX_BPF_MEM_##SIZEOP: \
> + *(SIZE *)(unsigned long) (A + insn->off) = X; \
> + CONT; \
> + BPF_ST_BPF_MEM_##SIZEOP: \
> + *(SIZE *)(unsigned long) (A + insn->off) = K; \
> + CONT; \
> + BPF_LDX_BPF_MEM_##SIZEOP: \
> + A = *(SIZE *)(unsigned long) (X + insn->off); \
> + CONT;
> +
> + LDST(BPF_B, u8)
> + LDST(BPF_H, u16)
> + LDST(BPF_W, u32)
> + LDST(BPF_DW, u64)
> +#undef LDST
> + BPF_STX_BPF_XADD_BPF_W: /* lock xadd *(u32 *)(A + insn->off) += X */
> + atomic_add((u32) X, (atomic_t *)(unsigned long)
> + (A + insn->off));
> + CONT;
> + BPF_STX_BPF_XADD_BPF_DW: /* lock xadd *(u64 *)(A + insn->off) += X */
> + atomic64_add((u64) X, (atomic64_t *)(unsigned long)
> + (A + insn->off));
> + CONT;
> + BPF_LD_BPF_ABS_BPF_W: /* A = *(u32 *)(ctx + K) */
> + off = K;
> +load_word:
> + /* BPF_LD + BPD_ABS and BPF_LD + BPF_IND insns are only
> + * appearing in the programs where ctx == skb.
> + */
> + ptr = load_pointer((struct sk_buff *) ctx, off, 4, &tmp);
> + if (likely(ptr != NULL)) {
> + A = get_unaligned_be32(ptr);
> + CONT;
> + }
> + return 0;
> + BPF_LD_BPF_ABS_BPF_H: /* A = *(u16 *)(ctx + K) */
> + off = K;
> +load_half:
> + ptr = load_pointer((struct sk_buff *) ctx, off, 2, &tmp);
> + if (likely(ptr != NULL)) {
> + A = get_unaligned_be16(ptr);
> + CONT;
> + }
> + return 0;
> +
> + BPF_LD_BPF_ABS_BPF_B: /* A = *(u8 *)(ctx + K) */
> + off = K;
> +load_byte:
> + ptr = load_pointer((struct sk_buff *) ctx, off, 1, &tmp);
> + if (likely(ptr != NULL)) {
> + A = *(u8 *)ptr;
> + CONT;
> + }
> + return 0;
> + BPF_LD_BPF_IND_BPF_W: /* A = *(u32 *)(ctx + X + K) */
> + off = K + X;
> + goto load_word;
> + BPF_LD_BPF_IND_BPF_H: /* A = *(u16 *)(ctx + X + K) */
> + off = K + X;
> + goto load_half;
> + BPF_LD_BPF_IND_BPF_B: /* A = *(u8 *)(ctx + X + K) */
> + off = K + X;
> + goto load_byte;
> +
> + /* RET */
> + BPF_RET_BPF_K_0:
> + return regs[0 /* R0 */];
> +
> + default_label:
> + /* If we ever reach this, we have a bug somewhere. */
> + WARN_RATELIMIT(1, "unknown opcode %02x\n", insn->code);
> + return 0;
> +#undef CONT_JMP
> +#undef CONT
> +#undef A
> +#undef X
> +#undef K
> +}
> +
> +u32 sk_run_filter_int_seccomp(const struct seccomp_data *ctx,
> + const struct sock_filter_int *insni)
> + __attribute__ ((alias ("__sk_run_filter")));
> +
> +u32 sk_run_filter_int_skb(const struct sk_buff *ctx,
> + const struct sock_filter_int *insni)
> + __attribute__ ((alias ("__sk_run_filter")));
> +EXPORT_SYMBOL_GPL(sk_run_filter_int_skb);
> +
> +/* Helper to find the offset of pkt_type in sk_buff structure. We want
> + * to make sure its still a 3bit field starting at a byte boundary;
> + * taken from arch/x86/net/bpf_jit_comp.c.
> + */
> +#define PKT_TYPE_MAX 7
> +static unsigned int pkt_type_offset(void)
> +{
> + struct sk_buff skb_probe = { .pkt_type = ~0, };
> + u8 *ct = (u8 *) &skb_probe;
> + unsigned int off;
> +
> + for (off = 0; off < sizeof(struct sk_buff); off++) {
> + if (ct[off] == PKT_TYPE_MAX)
> + return off;
> + }
> +
> + pr_err_once("Please fix %s, as pkt_type couldn't be found!\n", __func__);
> + return -1;
> +}
> +
> +static u64 __skb_get_pay_offset(u64 ctx, u64 A, u64 X, u64 r4, u64 r5)
> +{
> + struct sk_buff *skb = (struct sk_buff *)(long) ctx;
> +
> + return __skb_get_poff(skb);
> +}
> +
> +static u64 __skb_get_nlattr(u64 ctx, u64 A, u64 X, u64 r4, u64 r5)
> +{
> + struct sk_buff *skb = (struct sk_buff *)(long) ctx;
> + struct nlattr *nla;
> +
> + if (skb_is_nonlinear(skb))
> + return 0;
> +
> + if (A > skb->len - sizeof(struct nlattr))
> + return 0;
> +
> + nla = nla_find((struct nlattr *) &skb->data[A], skb->len - A, X);
> + if (nla)
> + return (void *) nla - (void *) skb->data;
> +
> + return 0;
> +}
> +
> +static u64 __skb_get_nlattr_nest(u64 ctx, u64 A, u64 X, u64 r4, u64 r5)
> +{
> + struct sk_buff *skb = (struct sk_buff *)(long) ctx;
> + struct nlattr *nla;
> +
> + if (skb_is_nonlinear(skb))
> + return 0;
> +
> + if (A > skb->len - sizeof(struct nlattr))
> + return 0;
> +
> + nla = (struct nlattr *) &skb->data[A];
> + if (nla->nla_len > A - skb->len)
> + return 0;
> +
> + nla = nla_find_nested(nla, X);
> + if (nla)
> + return (void *) nla - (void *) skb->data;
> +
> + return 0;
> +}
> +
> +static u64 __get_raw_cpu_id(u64 ctx, u64 A, u64 X, u64 r4, u64 r5)
> +{
> + return raw_smp_processor_id();
> +}
> +
> +/* Register mappings for user programs. */
> +#define A_REG 6
> +#define X_REG 7
> +#define TMP_REG 8
> +
> +static bool convert_bpf_extensions(struct sock_filter *fp,
> + struct sock_filter_int **insnp)
> +{
> + struct sock_filter_int *insn = *insnp;
> +
> + switch (fp->k) {
> + case SKF_AD_OFF + SKF_AD_PROTOCOL:
> + BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, protocol) != 2);
> +
> + insn->code = BPF_LDX | BPF_MEM | BPF_H;
> + insn->a_reg = A_REG;
> + insn->x_reg = CTX_REG;
> + insn->off = offsetof(struct sk_buff, protocol);
> +#ifdef __LITTLE_ENDIAN
> + insn++;
> +
> + /* A = swab32(A) */
> + insn->code = BPF_ALU | BPF_BSWAP | BPF_X;
> + insn->a_reg = A_REG;
> + insn++;
> +
> + /* A >>= 16 */
> + insn->code = BPF_ALU | BPF_RSH | BPF_K;
> + insn->a_reg = A_REG;
> + insn->imm = 16;
> +#endif /* __LITTLE_ENDIAN */
> + break;
> +
> + case SKF_AD_OFF + SKF_AD_PKTTYPE:
> + insn->code = BPF_LDX | BPF_MEM | BPF_B;
> + insn->a_reg = A_REG;
> + insn->x_reg = CTX_REG;
> + insn->off = pkt_type_offset();
> + if (insn->off < 0)
> + return false;
> + insn++;
> +
> + insn->code = BPF_ALU | BPF_AND | BPF_K;
> + insn->a_reg = A_REG;
> + insn->imm = PKT_TYPE_MAX;
> + break;
> +
> + case SKF_AD_OFF + SKF_AD_IFINDEX:
> + case SKF_AD_OFF + SKF_AD_HATYPE:
> + if (FIELD_SIZEOF(struct sk_buff, dev) == 8)
> + insn->code = BPF_LDX | BPF_MEM | BPF_DW;
> + else
> + insn->code = BPF_LDX | BPF_MEM | BPF_W;
> + insn->a_reg = TMP_REG;
> + insn->x_reg = CTX_REG;
> + insn->off = offsetof(struct sk_buff, dev);
> + insn++;
> +
> + insn->code = BPF_JMP | BPF_JNE | BPF_K;
> + insn->a_reg = TMP_REG;
> + insn->imm = 0;
> + insn->off = 1;
> + insn++;
> +
> + insn->code = BPF_RET | BPF_K;
> + insn++;
> +
> + BUILD_BUG_ON(FIELD_SIZEOF(struct net_device, ifindex) != 4);
> + BUILD_BUG_ON(FIELD_SIZEOF(struct net_device, type) != 2);
> +
> + insn->a_reg = A_REG;
> + insn->x_reg = TMP_REG;
> +
> + if (fp->k == SKF_AD_OFF + SKF_AD_IFINDEX) {
> + insn->code = BPF_LDX | BPF_MEM | BPF_W;
> + insn->off = offsetof(struct net_device, ifindex);
> + } else {
> + insn->code = BPF_LDX | BPF_MEM | BPF_H;
> + insn->off = offsetof(struct net_device, type);
> + }
> + break;
> +
> + case SKF_AD_OFF + SKF_AD_MARK:
> + BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, mark) != 4);
> +
> + insn->code = BPF_LDX | BPF_MEM | BPF_W;
> + insn->a_reg = A_REG;
> + insn->x_reg = CTX_REG;
> + insn->off = offsetof(struct sk_buff, mark);
> + break;
> +
> + case SKF_AD_OFF + SKF_AD_RXHASH:
> + BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, rxhash) != 4);
> +
> + insn->code = BPF_LDX | BPF_MEM | BPF_W;
> + insn->a_reg = A_REG;
> + insn->x_reg = CTX_REG;
> + insn->off = offsetof(struct sk_buff, rxhash);
> + break;
> +
> + case SKF_AD_OFF + SKF_AD_QUEUE:
> + BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, queue_mapping) != 2);
> +
> + insn->code = BPF_LDX | BPF_MEM | BPF_H;
> + insn->a_reg = A_REG;
> + insn->x_reg = CTX_REG;
> + insn->off = offsetof(struct sk_buff, queue_mapping);
> + break;
> +
> + case SKF_AD_OFF + SKF_AD_VLAN_TAG:
> + case SKF_AD_OFF + SKF_AD_VLAN_TAG_PRESENT:
> + BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, vlan_tci) != 2);
> +
> + insn->code = BPF_LDX | BPF_MEM | BPF_H;
> + insn->a_reg = A_REG;
> + insn->x_reg = CTX_REG;
> + insn->off = offsetof(struct sk_buff, vlan_tci);
> + insn++;
> +
> + BUILD_BUG_ON(VLAN_TAG_PRESENT != 0x1000);
> +
> + if (fp->k == SKF_AD_OFF + SKF_AD_VLAN_TAG) {
> + insn->code = BPF_ALU | BPF_AND | BPF_K;
> + insn->a_reg = A_REG;
> + insn->imm = ~VLAN_TAG_PRESENT;
> + } else {
> + insn->code = BPF_ALU | BPF_RSH | BPF_K;
> + insn->a_reg = A_REG;
> + insn->imm = 12;
> + insn++;
> +
> + insn->code = BPF_ALU | BPF_AND | BPF_K;
> + insn->a_reg = A_REG;
> + insn->imm = 1;
> + }
> + break;
> +
> + case SKF_AD_OFF + SKF_AD_PAY_OFFSET:
> + case SKF_AD_OFF + SKF_AD_NLATTR:
> + case SKF_AD_OFF + SKF_AD_NLATTR_NEST:
> + case SKF_AD_OFF + SKF_AD_CPU:
> + /* Save ctx */
> + insn->code = BPF_ALU64 | BPF_MOV | BPF_X;
> + insn->a_reg = TMP_REG;
> + insn->x_reg = CTX_REG;
> + insn++;
> +
> + /* arg2 = A */
> + insn->code = BPF_ALU64 | BPF_MOV | BPF_X;
> + insn->a_reg = 2;
> + insn->x_reg = A_REG;
> + insn++;
> +
> + /* arg3 = X */
> + insn->code = BPF_ALU64 | BPF_MOV | BPF_X;
> + insn->a_reg = 3;
> + insn->x_reg = X_REG;
> + insn++;
> +
> + /* Emit call(ctx, arg2=A, arg3=X) */
> + insn->code = BPF_JMP | BPF_CALL;
> + /* Re: sparse ... Share your drugs? High on caffeine ... ;-) */
> + switch (fp->k) {
> + case SKF_AD_OFF + SKF_AD_PAY_OFFSET:
> + insn->imm = __skb_get_pay_offset - __bpf_call_base;
> + break;
> + case SKF_AD_OFF + SKF_AD_NLATTR:
> + insn->imm = __skb_get_nlattr - __bpf_call_base;
> + break;
> + case SKF_AD_OFF + SKF_AD_NLATTR_NEST:
> + insn->imm = __skb_get_nlattr_nest - __bpf_call_base;
> + break;
> + case SKF_AD_OFF + SKF_AD_CPU:
> + insn->imm = __get_raw_cpu_id - __bpf_call_base;
> + break;
> + }
> + insn++;
> +
> + /* Restore ctx */
> + insn->code = BPF_ALU64 | BPF_MOV | BPF_X;
> + insn->a_reg = CTX_REG;
> + insn->x_reg = TMP_REG;
> + insn++;
> +
> + /* Move ret value into A_REG */
> + insn->code = BPF_ALU64 | BPF_MOV | BPF_X;
> + insn->a_reg = A_REG;
> + insn->x_reg = 0;
> + break;
> +
> + case SKF_AD_OFF + SKF_AD_ALU_XOR_X:
> + insn->code = BPF_ALU | BPF_XOR | BPF_X;
> + insn->a_reg = A_REG;
> + insn->x_reg = X_REG;
> + break;
> +
> + default:
> + /* This is just a dummy call to avoid letting the compiler
> + * evict __bpf_call_base() as an optimization. Placed here
> + * where no-one bothers.
> + */
> + BUG_ON(__bpf_call_base(0, 0, 0, 0, 0) != 0);
> + return false;
> + }
> +
> + *insnp = insn;
> + return true;
> +}
> +
> +/**
> + * sk_convert_filter - convert filter program
> + * @prog: the user passed filter program
> + * @len: the length of the user passed filter program
> + * @new_prog: buffer where converted program will be stored
> + * @new_len: pointer to store length of converted program
> + *
> + * Remap 'sock_filter' style BPF instruction set to 'sock_filter_ext' style.
> + * Conversion workflow:
> + *
> + * 1) First pass for calculating the new program length:
> + * sk_convert_filter(old_prog, old_len, NULL, &new_len)
> + *
> + * 2) 2nd pass to remap in two passes: 1st pass finds new
> + * jump offsets, 2nd pass remapping:
> + * new_prog = kmalloc(sizeof(struct sock_filter_int) * new_len);
> + * sk_convert_filter(old_prog, old_len, new_prog, &new_len);
> + *
> + * User BPF's register A is mapped to our BPF register 6, user BPF
> + * register X is mapped to BPF register 7; frame pointer is always
> + * register 10; Context 'void *ctx' is stored in register 1, that is,
> + * for socket filters: ctx == 'struct sk_buff *', for seccomp:
> + * ctx == 'struct seccomp_data *'.
> + */
> +int sk_convert_filter(struct sock_filter *prog, int len,
> + struct sock_filter_int *new_prog, int *new_len)
> +{
> + int new_flen = 0, pass = 0, target, i;
> + struct sock_filter_int *new_insn;
> + struct sock_filter *fp;
> + int *addrs = NULL;
> + u8 bpf_src;
> +
> + BUILD_BUG_ON(BPF_MEMWORDS * sizeof(u32) > MAX_BPF_STACK);
> + BUILD_BUG_ON(FP_REG + 1 != MAX_BPF_REG);
> +
> + if (len <= 0 || len >= BPF_MAXINSNS)
> + return -EINVAL;
> +
> + if (new_prog) {
> + addrs = kzalloc(len * sizeof(*addrs), GFP_KERNEL);
> + if (!addrs)
> + return -ENOMEM;
> + }
> +
> +do_pass:
> + new_insn = new_prog;
> + fp = prog;
> +
> + for (i = 0; i < len; fp++, i++) {
> + struct sock_filter_int tmp_insns[6] = { };
> + struct sock_filter_int *insn = tmp_insns;
> +
> + if (addrs)
> + addrs[i] = new_insn - new_prog;
> +
> + switch (fp->code) {
> + /* All arithmetic insns and skb loads map as-is. */
> + case BPF_ALU | BPF_ADD | BPF_X:
> + case BPF_ALU | BPF_ADD | BPF_K:
> + case BPF_ALU | BPF_SUB | BPF_X:
> + case BPF_ALU | BPF_SUB | BPF_K:
> + case BPF_ALU | BPF_AND | BPF_X:
> + case BPF_ALU | BPF_AND | BPF_K:
> + case BPF_ALU | BPF_OR | BPF_X:
> + case BPF_ALU | BPF_OR | BPF_K:
> + case BPF_ALU | BPF_LSH | BPF_X:
> + case BPF_ALU | BPF_LSH | BPF_K:
> + case BPF_ALU | BPF_RSH | BPF_X:
> + case BPF_ALU | BPF_RSH | BPF_K:
> + case BPF_ALU | BPF_XOR | BPF_X:
> + case BPF_ALU | BPF_XOR | BPF_K:
> + case BPF_ALU | BPF_MUL | BPF_X:
> + case BPF_ALU | BPF_MUL | BPF_K:
> + case BPF_ALU | BPF_DIV | BPF_X:
> + case BPF_ALU | BPF_DIV | BPF_K:
> + case BPF_ALU | BPF_MOD | BPF_X:
> + case BPF_ALU | BPF_MOD | BPF_K:
> + case BPF_ALU | BPF_NEG:
> + case BPF_LD | BPF_ABS | BPF_W:
> + case BPF_LD | BPF_ABS | BPF_H:
> + case BPF_LD | BPF_ABS | BPF_B:
> + case BPF_LD | BPF_IND | BPF_W:
> + case BPF_LD | BPF_IND | BPF_H:
> + case BPF_LD | BPF_IND | BPF_B:
> + /* Check for overloaded BPF extension and
> + * directly convert it if found, otherwise
> + * just move on with mapping.
> + */
> + if (BPF_CLASS(fp->code) == BPF_LD &&
> + BPF_MODE(fp->code) == BPF_ABS &&
> + convert_bpf_extensions(fp, &insn))
> + break;
> +
> + insn->code = fp->code;
> + insn->a_reg = A_REG;
> + insn->x_reg = X_REG;
> + insn->imm = fp->k;
> + break;
> +
> + /* Jump opcodes map as-is, but offsets need adjustment. */
> + case BPF_JMP | BPF_JA:
> + target = i + fp->k + 1;
> + insn->code = fp->code;
> +#define EMIT_JMP \
> + do { \
> + if (target >= len || target < 0) \
> + goto err; \
> + insn->off = addrs ? addrs[target] - addrs[i] - 1 : 0; \
> + /* Adjust pc relative offset for 2nd or 3rd insn. */ \
> + insn->off -= insn - tmp_insns; \
> + } while (0)
> +
> + EMIT_JMP;
> + break;
> +
> + case BPF_JMP | BPF_JEQ | BPF_K:
> + case BPF_JMP | BPF_JEQ | BPF_X:
> + case BPF_JMP | BPF_JSET | BPF_K:
> + case BPF_JMP | BPF_JSET | BPF_X:
> + case BPF_JMP | BPF_JGT | BPF_K:
> + case BPF_JMP | BPF_JGT | BPF_X:
> + case BPF_JMP | BPF_JGE | BPF_K:
> + case BPF_JMP | BPF_JGE | BPF_X:
> + if (BPF_SRC(fp->code) == BPF_K && (int) fp->k < 0) {
> + /* BPF immediates are signed, zero extend
> + * immediate into tmp register and use it
> + * in compare insn.
> + */
> + insn->code = BPF_ALU | BPF_MOV | BPF_K;
> + insn->a_reg = TMP_REG;
> + insn->imm = fp->k;
> + insn++;
> +
> + insn->a_reg = A_REG;
> + insn->x_reg = TMP_REG;
> + bpf_src = BPF_X;
> + } else {
> + insn->a_reg = A_REG;
> + insn->x_reg = X_REG;
> + insn->imm = fp->k;
> + bpf_src = BPF_SRC(fp->code);
> }
> - return 0;
> - case BPF_S_LD_B_ABS:
> - k = K;
> -load_b:
> - ptr = load_pointer(skb, k, 1, &tmp);
> - if (ptr != NULL) {
> - A = *(u8 *)ptr;
> - continue;
> +
> + /* Common case where 'jump_false' is next insn. */
> + if (fp->jf == 0) {
> + insn->code = BPF_JMP | BPF_OP(fp->code) | bpf_src;
> + target = i + fp->jt + 1;
> + EMIT_JMP;
> + break;
> }
> - return 0;
> - case BPF_S_LD_W_LEN:
> - A = skb->len;
> - continue;
> - case BPF_S_LDX_W_LEN:
> - X = skb->len;
> - continue;
> - case BPF_S_LD_W_IND:
> - k = X + K;
> - goto load_w;
> - case BPF_S_LD_H_IND:
> - k = X + K;
> - goto load_h;
> - case BPF_S_LD_B_IND:
> - k = X + K;
> - goto load_b;
> - case BPF_S_LDX_B_MSH:
> - ptr = load_pointer(skb, K, 1, &tmp);
> - if (ptr != NULL) {
> - X = (*(u8 *)ptr & 0xf) << 2;
> - continue;
> +
> + /* Convert JEQ into JNE when 'jump_true' is next insn. */
> + if (fp->jt == 0 && BPF_OP(fp->code) == BPF_JEQ) {
> + insn->code = BPF_JMP | BPF_JNE | bpf_src;
> + target = i + fp->jf + 1;
> + EMIT_JMP;
> + break;
> }
> - return 0;
> - case BPF_S_LD_IMM:
> - A = K;
> - continue;
> - case BPF_S_LDX_IMM:
> - X = K;
> - continue;
> - case BPF_S_LD_MEM:
> - A = mem[K];
> - continue;
> - case BPF_S_LDX_MEM:
> - X = mem[K];
> - continue;
> - case BPF_S_MISC_TAX:
> - X = A;
> - continue;
> - case BPF_S_MISC_TXA:
> - A = X;
> - continue;
> - case BPF_S_RET_K:
> - return K;
> - case BPF_S_RET_A:
> - return A;
> - case BPF_S_ST:
> - mem[K] = A;
> - continue;
> - case BPF_S_STX:
> - mem[K] = X;
> - continue;
> - case BPF_S_ANC_PROTOCOL:
> - A = ntohs(skb->protocol);
> - continue;
> - case BPF_S_ANC_PKTTYPE:
> - A = skb->pkt_type;
> - continue;
> - case BPF_S_ANC_IFINDEX:
> - if (!skb->dev)
> - return 0;
> - A = skb->dev->ifindex;
> - continue;
> - case BPF_S_ANC_MARK:
> - A = skb->mark;
> - continue;
> - case BPF_S_ANC_QUEUE:
> - A = skb->queue_mapping;
> - continue;
> - case BPF_S_ANC_HATYPE:
> - if (!skb->dev)
> - return 0;
> - A = skb->dev->type;
> - continue;
> - case BPF_S_ANC_RXHASH:
> - A = skb->rxhash;
> - continue;
> - case BPF_S_ANC_CPU:
> - A = raw_smp_processor_id();
> - continue;
> - case BPF_S_ANC_VLAN_TAG:
> - A = vlan_tx_tag_get(skb);
> - continue;
> - case BPF_S_ANC_VLAN_TAG_PRESENT:
> - A = !!vlan_tx_tag_present(skb);
> - continue;
> - case BPF_S_ANC_PAY_OFFSET:
> - A = __skb_get_poff(skb);
> - continue;
> - case BPF_S_ANC_NLATTR: {
> - struct nlattr *nla;
> -
> - if (skb_is_nonlinear(skb))
> - return 0;
> - if (A > skb->len - sizeof(struct nlattr))
> - return 0;
> -
> - nla = nla_find((struct nlattr *)&skb->data[A],
> - skb->len - A, X);
> - if (nla)
> - A = (void *)nla - (void *)skb->data;
> - else
> - A = 0;
> - continue;
> - }
> - case BPF_S_ANC_NLATTR_NEST: {
> - struct nlattr *nla;
> -
> - if (skb_is_nonlinear(skb))
> - return 0;
> - if (A > skb->len - sizeof(struct nlattr))
> - return 0;
> -
> - nla = (struct nlattr *)&skb->data[A];
> - if (nla->nla_len > A - skb->len)
> - return 0;
> -
> - nla = nla_find_nested(nla, X);
> - if (nla)
> - A = (void *)nla - (void *)skb->data;
> - else
> - A = 0;
> - continue;
> - }
> -#ifdef CONFIG_SECCOMP_FILTER
> - case BPF_S_ANC_SECCOMP_LD_W:
> - A = seccomp_bpf_load(fentry->k);
> - continue;
> -#endif
> +
> + /* Other jumps are mapped into two insns: Jxx and JA. */
> + target = i + fp->jt + 1;
> + insn->code = BPF_JMP | BPF_OP(fp->code) | bpf_src;
> + EMIT_JMP;
> + insn++;
> +
> + insn->code = BPF_JMP | BPF_JA;
> + target = i + fp->jf + 1;
> + EMIT_JMP;
> + break;
> +
> + /* ldxb 4 * ([14] & 0xf) is remaped into 3 insns. */
> + case BPF_LDX | BPF_MSH | BPF_B:
> + insn->code = BPF_LD | BPF_ABS | BPF_B;
> + insn->a_reg = X_REG;
> + insn->imm = fp->k;
> + insn++;
> +
> + insn->code = BPF_ALU | BPF_AND | BPF_K;
> + insn->a_reg = X_REG;
> + insn->imm = 0xf;
> + insn++;
> +
> + insn->code = BPF_ALU | BPF_LSH | BPF_K;
> + insn->a_reg = X_REG;
> + insn->imm = 2;
> + break;
> +
> + /* RET_K, RET_A are remaped into 2 insns. */
> + case BPF_RET | BPF_A:
> + case BPF_RET | BPF_K:
> + insn->code = BPF_ALU | BPF_MOV |
> + (BPF_RVAL(fp->code) == BPF_K ?
> + BPF_K : BPF_X);
> + insn->a_reg = 0;
> + insn->x_reg = A_REG;
> + insn->imm = fp->k;
> + insn++;
> +
> + insn->code = BPF_RET | BPF_K;
> + break;
> +
> + /* Store to stack. */
> + case BPF_ST:
> + case BPF_STX:
> + insn->code = BPF_STX | BPF_MEM | BPF_W;
> + insn->a_reg = FP_REG;
> + insn->x_reg = fp->code == BPF_ST ? A_REG : X_REG;
> + insn->off = -(BPF_MEMWORDS - fp->k) * 4;
> + break;
> +
> + /* Load from stack. */
> + case BPF_LD | BPF_MEM:
> + case BPF_LDX | BPF_MEM:
> + insn->code = BPF_LDX | BPF_MEM | BPF_W;
> + insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ?
> + A_REG : X_REG;
> + insn->x_reg = FP_REG;
> + insn->off = -(BPF_MEMWORDS - fp->k) * 4;
> + break;
> +
> + /* A = K or X = K */
> + case BPF_LD | BPF_IMM:
> + case BPF_LDX | BPF_IMM:
> + insn->code = BPF_ALU | BPF_MOV | BPF_K;
> + insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ?
> + A_REG : X_REG;
> + insn->imm = fp->k;
> + break;
> +
> + /* X = A */
> + case BPF_MISC | BPF_TAX:
> + insn->code = BPF_ALU64 | BPF_MOV | BPF_X;
> + insn->a_reg = X_REG;
> + insn->x_reg = A_REG;
> + break;
> +
> + /* A = X */
> + case BPF_MISC | BPF_TXA:
> + insn->code = BPF_ALU64 | BPF_MOV | BPF_X;
> + insn->a_reg = A_REG;
> + insn->x_reg = X_REG;
> + break;
> +
> + /* A = skb->len or X = skb->len */
> + case BPF_LD | BPF_W | BPF_LEN:
> + case BPF_LDX | BPF_W | BPF_LEN:
> + insn->code = BPF_LDX | BPF_MEM | BPF_W;
> + insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ?
> + A_REG : X_REG;
> + insn->x_reg = CTX_REG;
> + insn->off = offsetof(struct sk_buff, len);
> + break;
> +
> + /* access seccomp_data fields */
> + case BPF_LDX | BPF_ABS | BPF_W:
> + insn->code = BPF_LDX | BPF_MEM | BPF_W;
> + insn->a_reg = A_REG;
> + insn->x_reg = CTX_REG;
> + insn->off = fp->k;
> + break;
> +
> default:
> - WARN_RATELIMIT(1, "Unknown code:%u jt:%u tf:%u k:%u\n",
> - fentry->code, fentry->jt,
> - fentry->jf, fentry->k);
> - return 0;
> + goto err;
> }
> +
> + insn++;
> + if (new_prog)
> + memcpy(new_insn, tmp_insns,
> + sizeof(*insn) * (insn - tmp_insns));
> +
> + new_insn += insn - tmp_insns;
> }
>
> + if (!new_prog) {
> + /* Only calculating new length. */
> + *new_len = new_insn - new_prog;
> + return 0;
> + }
> +
> + pass++;
> + if (new_flen != new_insn - new_prog) {
> + new_flen = new_insn - new_prog;
> + if (pass > 2)
> + goto err;
> +
> + goto do_pass;
> + }
> +
> + kfree(addrs);
> + BUG_ON(*new_len != new_flen);
> return 0;
> +err:
> + kfree(addrs);
> + return -EINVAL;
> }
> -EXPORT_SYMBOL(sk_run_filter);
>
> -/*
> - * Security :
> +/* Security:
> + *
> * A BPF program is able to use 16 cells of memory to store intermediate
> - * values (check u32 mem[BPF_MEMWORDS] in sk_run_filter())
> + * values (check u32 mem[BPF_MEMWORDS] in sk_run_filter()).
> + *
> * As we dont want to clear mem[] array for each packet going through
> * sk_run_filter(), we check that filter loaded by user never try to read
> * a cell if not previously written, and we check all branches to be sure
> @@ -696,19 +1400,130 @@ void sk_filter_charge(struct sock *sk, struct sk_filter *fp)
> atomic_add(sk_filter_size(fp->len), &sk->sk_omem_alloc);
> }
>
> -static int __sk_prepare_filter(struct sk_filter *fp)
> +static struct sk_filter *__sk_migrate_realloc(struct sk_filter *fp,
> + struct sock *sk,
> + unsigned int len)
> +{
> + struct sk_filter *fp_new;
> +
> + if (sk == NULL)
> + return krealloc(fp, len, GFP_KERNEL);
> +
> + fp_new = sock_kmalloc(sk, len, GFP_KERNEL);
> + if (fp_new) {
> + memcpy(fp_new, fp, sizeof(struct sk_filter));
> + /* As we're kepping orig_prog in fp_new along,
> + * we need to make sure we're not evicting it
> + * from the old fp.
> + */
> + fp->orig_prog = NULL;
> + sk_filter_uncharge(sk, fp);
> + }
> +
> + return fp_new;
> +}
> +
> +static struct sk_filter *__sk_migrate_filter(struct sk_filter *fp,
> + struct sock *sk)
> +{
> + struct sock_filter *old_prog;
> + struct sk_filter *old_fp;
> + int i, err, new_len, old_len = fp->len;
> +
> + /* We are free to overwrite insns et al right here as it
> + * won't be used at this point in time anymore internally
> + * after the migration to the internal BPF instruction
> + * representation.
> + */
> + BUILD_BUG_ON(sizeof(struct sock_filter) !=
> + sizeof(struct sock_filter_int));
> +
> + /* For now, we need to unfiddle BPF_S_* identifiers in place.
> + * This can sooner or later on be subject to removal, e.g. when
> + * JITs have been converted.
> + */
> + for (i = 0; i < fp->len; i++)
> + sk_decode_filter(&fp->insns[i], &fp->insns[i]);
> +
> + /* Conversion cannot happen on overlapping memory areas,
> + * so we need to keep the user BPF around until the 2nd
> + * pass. At this time, the user BPF is stored in fp->insns.
> + */
> + old_prog = kmemdup(fp->insns, old_len * sizeof(struct sock_filter),
> + GFP_KERNEL);
> + if (!old_prog) {
> + err = -ENOMEM;
> + goto out_err;
> + }
> +
> + /* 1st pass: calculate the new program length. */
> + err = sk_convert_filter(old_prog, old_len, NULL, &new_len);
> + if (err)
> + goto out_err_free;
> +
> + /* Expand fp for appending the new filter representation. */
> + old_fp = fp;
> + fp = __sk_migrate_realloc(old_fp, sk, sk_filter_size(new_len));
> + if (!fp) {
> + /* The old_fp is still around in case we couldn't
> + * allocate new memory, so uncharge on that one.
> + */
> + fp = old_fp;
> + err = -ENOMEM;
> + goto out_err_free;
> + }
> +
> + fp->bpf_func = sk_run_filter_int_skb;
> + fp->len = new_len;
> +
> + /* 2nd pass: remap sock_filter insns into sock_filter_int insns. */
> + err = sk_convert_filter(old_prog, old_len, fp->insnsi, &new_len);
> + if (err)
> + /* 2nd sk_convert_filter() can fail only if it fails
> + * to allocate memory, remapping must succeed. Note,
> + * that at this time old_fp has already been released
> + * by __sk_migrate_realloc().
> + */
> + goto out_err_free;
> +
> + kfree(old_prog);
> + return fp;
> +
> +out_err_free:
> + kfree(old_prog);
> +out_err:
> + /* Rollback filter setup. */
> + if (sk != NULL)
> + sk_filter_uncharge(sk, fp);
> + else
> + kfree(fp);
> + return ERR_PTR(err);
> +}
> +
> +static struct sk_filter *__sk_prepare_filter(struct sk_filter *fp,
> + struct sock *sk)
> {
> int err;
>
> - fp->bpf_func = sk_run_filter;
> + fp->bpf_func = NULL;
> fp->jited = 0;
>
> err = sk_chk_filter(fp->insns, fp->len);
> if (err)
> - return err;
> + return ERR_PTR(err);
>
> + /* Probe if we can JIT compile the filter and if so, do
> + * the compilation of the filter.
> + */
> bpf_jit_compile(fp);
> - return 0;
> +
> + /* JIT compiler couldn't process this filter, so do the
> + * internal BPF translation for the optimized interpreter.
> + */
> + if (!fp->jited)
> + fp = __sk_migrate_filter(fp, sk);
> +
> + return fp;
> }
>
> /**
> @@ -726,7 +1541,6 @@ int sk_unattached_filter_create(struct sk_filter **pfp,
> {
> unsigned int fsize = sk_filter_proglen(fprog);
> struct sk_filter *fp;
> - int err;
>
> /* Make sure new filter is there and in the right amounts. */
> if (fprog->filter == NULL)
> @@ -746,15 +1560,15 @@ int sk_unattached_filter_create(struct sk_filter **pfp,
> */
> fp->orig_prog = NULL;
>
> - err = __sk_prepare_filter(fp);
> - if (err)
> - goto free_mem;
> + /* __sk_prepare_filter() already takes care of uncharging
> + * memory in case something goes wrong.
> + */
> + fp = __sk_prepare_filter(fp, NULL);
> + if (IS_ERR(fp))
> + return PTR_ERR(fp);
>
> *pfp = fp;
> return 0;
> -free_mem:
> - kfree(fp);
> - return err;
> }
> EXPORT_SYMBOL_GPL(sk_unattached_filter_create);
>
> @@ -806,11 +1620,12 @@ int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk)
> return -ENOMEM;
> }
>
> - err = __sk_prepare_filter(fp);
> - if (err) {
> - sk_filter_uncharge(sk, fp);
> - return err;
> - }
> + /* __sk_prepare_filter() already takes care of uncharging
> + * memory in case something goes wrong.
> + */
> + fp = __sk_prepare_filter(fp, sk);
> + if (IS_ERR(fp))
> + return PTR_ERR(fp);
>
> old_fp = rcu_dereference_protected(sk->sk_filter,
> sock_owned_by_user(sk));
> --
> 1.7.11.7
>
--
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/