[PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata

From: Alexander Lobakin
Date: Tue Jun 28 2022 - 15:52:08 EST


This RFC is to give the whole picture. It will most likely be split
onto several series, maybe even merge cycles. See the "table of
contents" below.

The series adds ability to pass different frame
details/parameters/parameters used by most of NICs and the kernel
stack (in skbs), not essential, but highly wanted, such as:

* checksum value, status (Rx) or command (Tx);
* hash value and type/level (Rx);
* queue number (Rx);
* timestamps;
* and so on.

As XDP structures used to represent frames are as small as possible
and must stay like that, it is done by using the already existing
concept of metadata, i.e. some space right before a frame where BPF
programs can put arbitrary data.

Now, a NIC driver, or even a SmartNIC itself, can put those params
there in a well-defined format. The format is fixed, but can be of
several different types represented by structures, which definitions
are available to the kernel, BPF programs and the userland.
It is fixed due to it being almost a UAPI, and the exact format can
be determined by reading the last 10 bytes of metadata. They contain
a 2-byte magic ID to not confuse it with a non-compatible meta and
a 8-byte combined BTF ID + type ID: the ID of the BTF where this
structure is defined and the ID of that definition inside that BTF.
Users can obtain BTF IDs by structure types using helpers available
in the kernel, BPF (written by the CO-RE/verifier) and the userland
(libbpf -> kernel call) and then rely on those ID when reading data
to make sure whether they support it and what to do with it.
Why separate magic and ID? The idea is to make different formats
always contain the basic/"generic" structure embedded at the end.
This way we can still benefit in purely generic consumers (like
cpumap) while providing some "extra" data to those who support it.

The enablement of this feature is controlled on attaching/replacing
XDP program on an interface with two new parameters: that combined
BTF+type ID and metadata threshold.
The threshold specifies the minimum frame size which a driver (or
NIC) should start composing metadata from. It is introduced instead
of just false/true flag due to that often it's not worth it to spend
cycles to fetch all that data for such small frames: let's say, it
can be even faster to just calculate checksums for them on CPU
rather than touch non-coherent DMA zone. Simple XDP_DROP case loses
15 Mpps on 64 byte frames with enabled metadata, threshold can help
mitigate that.

The RFC can be divided into 8 parts:

01-04: BTF ID hacking: here Larysa provides BPF programs with not
only type ID, but the ID of the BTF as well by using the
unused upper 32 bits.
05-10: this provides in-kernel mechanisms for taking ID and
threshold from the userspace and passing it to the drivers.
11-18: provides libbpf API to be able to specify those params from
the userspace, plus some small selftest to verify that both
the kernel and the userspace parts work.
19-29: here the actual structure is defined, then the in-kernel
helpers and finally here comes the first consumer: function
used to convert &xdp_frame to &sk_buff now will be trying
to parse metadata. The affected users are cpumap and veth.
30-36: here I try to benefit from the metadata in cpumap even more
by switching it to GRO. Now that we have checksums from NIC
available... but even with no meta it gives some fair
improvements.
37-43: enabling building generic metadata on Generic/skb path. Since
skbs already have all those fields, it's not a problem to do
this in here, plus allows to benefit from it on interfaces
not supporting meta yet.
44-47: ice driver part, including enabling prog hot-swap;
48-52: adds a complex selftest to verify everything works. Can be
used as a sample as well, showing how to work with metadata
in BPF programs and how to configure it from the userspace.

Please refer to the actual commit messages where some precise
implementation details might be explained.
Nearly 20 of 52 are various cleanups and prereqs, as usually.

Perf figures were taken on cpumap redirect from the ice interface
(driver-side XDP), redirecting the traffic within the same node.

Frame size / 64/42 128/20 256/8 512/4 1024/2 1532/1
thread num

meta off 30022 31350 21993 12144 6374 3610
meta on 33059 28502 21503 12146 6380 3610
GRO meta off 30020 31822 21970 12145 6384 3610
GRO meta on 34736 28848 21566 12144 6381 3610

Yes, redirect between the nodes plays awfully with the metadata
composed by the driver:

meta off 21449 18078 16897 11820 6383 3610
meta on 16956 19004 14337 8228 5683 2822
GRO meta off 22539 19129 16304 11659 6381 3592
GRO meta on 17047 20366 15435 8878 5600 2753

Questions still open:

* the actual generic structure: it must have all the fields used
oftenly and by the majority of NICs. It can always be expanded
later on (note that the structure grows to the left), but the
less often UAPI is modified, the better (less compat pain);
* ability to specify the exact fields to fill by the driver, e.g.
flags bitmap passed from the userspace. In theory it can be more
optimal to not spend cycles on data we don't need, but at the
same time increases the complexity of the whole concept (e.g. it
will be more problematic to unify drivers' routines for collecting
data from descriptors to metadata and to skbs);
* there was an idea to be able to specify from the userspace the
desired cacheline offset, so that [the wanted fields of] metadata
and the packet headers would lay in the same CL. Can't be
implemented in Generic/skb XDP and ice has some troubles with it
too;
* lacks AF_XDP/XSk perf numbers and different other scenarios in
general, is the current implementation optimal for them?
* metadata threshold and everything else present in this
implementation.

The RFC is also available on my open GitHub[0].

Merry and long review and discussion, enjoy!

[0] https://github.com/alobakin/linux/tree/xdp_hints

Alexander Lobakin (46):
libbpf: add function to get the pair BTF ID + type ID for a given type
net, xdp: decouple XDP code from the core networking code
bpf: pass a pointer to union bpf_attr to bpf_link_ops::update_prog()
net, xdp: remove redundant arguments from dev_xdp_{at,de}tach_link()
net, xdp: factor out XDP install arguments to a separate structure
net, xdp: add ability to specify BTF ID for XDP metadata
net, xdp: add ability to specify frame size threshold for XDP metadata
libbpf: factor out __bpf_set_link_xdp_fd_replace() args into a struct
libbpf: add ability to set the BTF/type ID on setting XDP prog
libbpf: add ability to set the meta threshold on setting XDP prog
libbpf: pass &bpf_link_create_opts directly to
bpf_program__attach_fd()
libbpf: add bpf_program__attach_xdp_opts()
selftests/bpf: expand xdp_link to check that setting meta opts works
samples/bpf: pass a struct to sample_install_xdp()
samples/bpf: add ability to specify metadata threshold
stddef: make __struct_group() UAPI C++-friendly
net, xdp: move XDP metadata helpers into new xdp_meta.h
net, xdp: allow metadata > 32
net, skbuff: add ability to skip skb metadata comparison
net, skbuff: constify the @skb argument of skb_hwtstamps()
net, xdp: add basic generic metadata accessors
bpf, btf: add a pair of function to work with the BTF ID + type ID
pair
net, xdp: add &sk_buff <-> &xdp_meta_generic converters
net, xdp: prefetch data a bit when building an skb from an &xdp_frame
net, xdp: try to fill skb fields when converting from an &xdp_frame
net, gro: decouple GRO from the NAPI layer
net, gro: expose some GRO API to use outside of NAPI
bpf, cpumap: switch to GRO from netif_receive_skb_list()
bpf, cpumap: add option to set a timeout for deferred flush
samples/bpf: add 'timeout' option to xdp_redirect_cpu
net, skbuff: introduce napi_skb_cache_get_bulk()
bpf, cpumap: switch to napi_skb_cache_get_bulk()
rcupdate: fix access helpers for incomplete struct pointers on GCC <
10
net, xdp: remove unused xdp_attachment_info::flags
net, xdp: make &xdp_attachment_info a bit more useful in drivers
net, xdp: add an RCU version of xdp_attachment_setup()
net, xdp: replace net_device::xdp_prog pointer with
&xdp_attachment_info
net, xdp: shortcut skb->dev in bpf_prog_run_generic_xdp()
net, xdp: build XDP generic metadata on Generic (skb) XDP path
net, ice: allow XDP prog hot-swapping
net, ice: consolidate all skb fields processing
net, ice: use an onstack &xdp_meta_generic_rx to store HW frame info
net, ice: build XDP generic metadata
libbpf: compress Endianness ops with a macro
selftests/bpf: fix using test_xdp_meta BPF prog via skeleton infra
selftests/bpf: add XDP Generic Hints selftest

Larysa Zaremba (5):
libbpf: factor out BTF loading from load_module_btfs()
libbpf: try to load vmlinux BTF from the kernel first
libbpf: patch module BTF ID into BPF insns
libbpf: add LE <--> CPU conversion helpers
libbpf: introduce a couple memory access helpers

Michal Swiatkowski (1):
bpf, xdp: declare generic XDP metadata structure

MAINTAINERS | 5 +-
drivers/net/ethernet/brocade/bna/bnad.c | 1 +
drivers/net/ethernet/cortina/gemini.c | 1 +
drivers/net/ethernet/intel/ice/ice.h | 16 +-
drivers/net/ethernet/intel/ice/ice_lib.c | 4 +-
drivers/net/ethernet/intel/ice/ice_main.c | 79 +-
drivers/net/ethernet/intel/ice/ice_ptp.c | 19 +-
drivers/net/ethernet/intel/ice/ice_ptp.h | 17 +-
drivers/net/ethernet/intel/ice/ice_txrx.c | 51 +-
drivers/net/ethernet/intel/ice/ice_txrx.h | 3 +-
drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 154 +--
drivers/net/ethernet/intel/ice/ice_txrx_lib.h | 88 +-
drivers/net/ethernet/intel/ice/ice_xsk.c | 26 +-
.../ethernet/mellanox/mlx5/core/en/xsk/rx.c | 1 +
drivers/net/ethernet/netronome/nfp/nfd3/xsk.c | 1 +
drivers/net/tun.c | 2 +-
include/linux/bpf.h | 3 +-
include/linux/btf.h | 13 +
include/linux/filter.h | 2 +
include/linux/netdevice.h | 41 +-
include/linux/rcupdate.h | 37 +-
include/linux/skbuff.h | 35 +-
include/net/gro.h | 53 +-
include/net/xdp.h | 34 +-
include/net/xdp_meta.h | 398 ++++++++
include/uapi/linux/bpf.h | 194 ++++
include/uapi/linux/if_link.h | 2 +
include/uapi/linux/stddef.h | 12 +-
kernel/bpf/bpf_iter.c | 1 +
kernel/bpf/btf.c | 133 ++-
kernel/bpf/cgroup.c | 4 +-
kernel/bpf/cpumap.c | 80 +-
kernel/bpf/net_namespace.c | 1 +
kernel/bpf/syscall.c | 4 +-
net/bpf/Makefile | 5 +-
net/{core/xdp.c => bpf/core.c} | 214 +++-
net/bpf/dev.c | 871 +++++++++++++++++
net/bpf/prog_ops.c | 912 ++++++++++++++++++
net/bpf/test_run.c | 2 +-
net/core/Makefile | 2 +-
net/core/dev.c | 869 +----------------
net/core/dev.h | 4 -
net/core/filter.c | 883 +----------------
net/core/gro.c | 120 ++-
net/core/rtnetlink.c | 24 +-
net/core/skbuff.c | 44 +
net/packet/af_packet.c | 8 +-
net/xdp/xsk.c | 2 +-
samples/bpf/xdp_redirect_cpu_user.c | 44 +-
samples/bpf/xdp_redirect_map_multi_user.c | 26 +-
samples/bpf/xdp_redirect_map_user.c | 22 +-
samples/bpf/xdp_redirect_user.c | 21 +-
samples/bpf/xdp_router_ipv4_user.c | 20 +-
samples/bpf/xdp_sample_user.c | 38 +-
samples/bpf/xdp_sample_user.h | 11 +-
tools/include/uapi/linux/bpf.h | 194 ++++
tools/include/uapi/linux/if_link.h | 2 +
tools/include/uapi/linux/stddef.h | 50 +
tools/lib/bpf/bpf.c | 22 +
tools/lib/bpf/bpf.h | 22 +-
tools/lib/bpf/bpf_core_read.h | 3 +-
tools/lib/bpf/bpf_endian.h | 56 +-
tools/lib/bpf/bpf_helpers.h | 64 ++
tools/lib/bpf/btf.c | 142 ++-
tools/lib/bpf/libbpf.c | 201 +++-
tools/lib/bpf/libbpf.h | 30 +-
tools/lib/bpf/libbpf.map | 2 +
tools/lib/bpf/libbpf_internal.h | 7 +-
tools/lib/bpf/netlink.c | 81 +-
tools/lib/bpf/relo_core.c | 8 +-
tools/lib/bpf/relo_core.h | 1 +
tools/testing/selftests/bpf/.gitignore | 1 +
tools/testing/selftests/bpf/Makefile | 4 +-
.../selftests/bpf/prog_tests/xdp_link.c | 30 +-
.../selftests/bpf/progs/test_xdp_meta.c | 40 +-
tools/testing/selftests/bpf/test_xdp_meta.c | 294 ++++++
tools/testing/selftests/bpf/test_xdp_meta.sh | 59 +-
77 files changed, 4758 insertions(+), 2212 deletions(-)
create mode 100644 include/net/xdp_meta.h
rename net/{core/xdp.c => bpf/core.c} (73%)
create mode 100644 net/bpf/dev.c
create mode 100644 net/bpf/prog_ops.c
create mode 100644 tools/include/uapi/linux/stddef.h
create mode 100644 tools/testing/selftests/bpf/test_xdp_meta.c

--
2.36.1