The XDP redirect process is two staged:
- bpf_prog_run_xdp() is invoked to run a eBPF program which inspects the
packet and makes decisions. While doing that, the per-CPU variable
bpf_redirect_info is used.
- Afterwards xdp_do_redirect() is invoked and accesses bpf_redirect_info
and it may also access other per-CPU variables like xskmap_flush_list.
At the very end of the NAPI callback, xdp_do_flush() is invoked which
does not access bpf_redirect_info but will touch the individual per-CPU
lists.
The per-CPU variables are only used in the NAPI callback hence disabling
bottom halves is the only protection mechanism. Users from preemptible
context (like cpu_map_kthread_run()) explicitly disable bottom halves
for protections reasons.
Without locking in local_bh_disable() on PREEMPT_RT this data structure
requires explicit locking.
PREEMPT_RT has forced-threaded interrupts enabled and every
NAPI-callback runs in a thread. If each thread has its own data
structure then locking can be avoided.
Create a struct bpf_net_context which contains struct bpf_redirect_info.
Define the variable on stack, use bpf_net_ctx_set() to save a pointer to
it, bpf_net_ctx_clear() removes it again.
The bpf_net_ctx_set() may nest. For instance a function can be used from
within NET_RX_SOFTIRQ/ net_rx_action which uses bpf_net_ctx_set() and
NET_TX_SOFTIRQ which does not. Therefore only the first invocations
updates the pointer.
Use bpf_net_ctx_get_ri() as a wrapper to retrieve the current struct
bpf_redirect_info. The returned data structure is zero initialized to
ensure nothing is leaked from stack. This is done on first usage of the
struct. bpf_net_ctx_set() sets bpf_redirect_info::kern_flags to 0 to
note that initialisation is required. First invocation of
bpf_net_ctx_get_ri() will memset() the data structure and update
bpf_redirect_info::kern_flags.
bpf_redirect_info::nh is excluded from memset because it is only used
once BPF_F_NEIGH is set which also sets the nh member. The kern_flags is
moved past nh to exclude it from memset.
The pointer to bpf_net_context is saved task's task_struct. Using
always the bpf_net_context approach has the advantage that there is
almost zero differences between PREEMPT_RT and non-PREEMPT_RT builds.
Cc: Alexei Starovoitov<ast@xxxxxxxxxx>
Cc: Andrii Nakryiko<andrii@xxxxxxxxxx>
Cc: Eduard Zingerman<eddyz87@xxxxxxxxx>
Cc: Hao Luo<haoluo@xxxxxxxxxx>
Cc: Jesper Dangaard Brouer<hawk@xxxxxxxxxx>
Cc: Jiri Olsa<jolsa@xxxxxxxxxx>
Cc: John Fastabend<john.fastabend@xxxxxxxxx>
Cc: KP Singh<kpsingh@xxxxxxxxxx>
Cc: Martin KaFai Lau<martin.lau@xxxxxxxxx>
Cc: Song Liu<song@xxxxxxxxxx>
Cc: Stanislav Fomichev<sdf@xxxxxxxxxx>
Cc: Toke Høiland-Jørgensen<toke@xxxxxxxxxx>
Cc: Yonghong Song<yonghong.song@xxxxxxxxx>
Cc:bpf@xxxxxxxxxxxxxxx
Acked-by: Alexei Starovoitov<ast@xxxxxxxxxx>
Reviewed-by: Toke Høiland-Jørgensen<toke@xxxxxxxxxx>
Signed-off-by: Sebastian Andrzej Siewior<bigeasy@xxxxxxxxxxxxx>
---
include/linux/filter.h | 56 ++++++++++++++++++++++++++++++++++--------
include/linux/sched.h | 3 +++
kernel/bpf/cpumap.c | 3 +++
kernel/bpf/devmap.c | 9 ++++++-
kernel/fork.c | 1 +
net/bpf/test_run.c | 11 ++++++++-
net/core/dev.c | 26 +++++++++++++++++++-
net/core/filter.c | 44 +++++++++------------------------
net/core/lwt_bpf.c | 3 +++
9 files changed, 111 insertions(+), 45 deletions(-)