[v3 net-next 00/10] skbuff: introduce skbuff_heads bulking and reusing

From: Alexander Lobakin
Date: Tue Feb 09 2021 - 19:16:23 EST


Currently, all sorts of skb allocation always do allocate
skbuff_heads one by one via kmem_cache_alloc().
On the other hand, we have percpu napi_alloc_cache to store
skbuff_heads queued up for freeing and flush them by bulks.

We can use this cache not only for bulk-wiping, but also to obtain
heads for new skbs and avoid unconditional allocations, as well as
for bulk-allocating.
As accessing napi_alloc_cache implies NAPI softirq context, decaching
is protected with in_serving_softirq() check, with the option to
bypass the check when the context is 100% known.

iperf3 showed 35-70 Mbps bumps for both TCP and UDP while performing
VLAN NAT on 1.2 GHz MIPS board. The boost is likely to be way bigger
on more powerful hosts and NICs with tens of Mpps.

Note on skbuff_heads from distant slabs or pfmemalloc'ed slabs:
- kmalloc()/kmem_cache_alloc() itself allows by default allocating
memory from the remote nodes to defragment their slabs. This is
controlled by sysctl, but according to this, skbuff_head from a
remote node is an OK case;
- The easiest way to check if the slab of skbuff_head is remote or
pfmemalloc'ed is:

if (!dev_page_is_reusable(virt_to_head_page(skb)))
/* drop it */;

...*but*, regarding that most slabs are built of compound pages,
virt_to_head_page() will hit unlikely-branch every single call.
This check costed at least 20 Mbps in test scenarios and seems
like it'd be better to _not_ do this.

Since v2 [1]:
- also cover {,__}alloc_skb() and {,__}build_skb() cases (became handy
after the changes that pass tiny skbs requests to kmalloc layer);
- cover the cache with KASAN instrumentation (suggested by Eric
Dumazet, help of Dmitry Vyukov);
- completely drop redundant __kfree_skb_flush() (also Eric);
- lots of code cleanups;
- expand the commit message with NUMA and pfmemalloc points (Jakub).

Since v1 [0]:
- use one unified cache instead of two separate to greatly simplify
the logics and reduce hotpath overhead (Edward Cree);
- new: recycle also GRO_MERGED_FREE skbs instead of immediate
freeing;
- correct performance numbers after optimizations and performing
lots of tests for different use cases.

[0] https://lore.kernel.org/netdev/20210111182655.12159-1-alobakin@xxxxx
[1] https://lore.kernel.org/netdev/20210113133523.39205-1-alobakin@xxxxx

Alexander Lobakin (10):
skbuff: move __alloc_skb() next to the other skb allocation functions
skbuff: simplify kmalloc_reserve()
skbuff: make __build_skb_around() return void
skbuff: simplify __alloc_skb() a bit
skbuff: use __build_skb_around() in __alloc_skb()
skbuff: remove __kfree_skb_flush()
skbuff: move NAPI cache declarations upper in the file
skbuff: reuse NAPI skb cache on allocation path (__build_skb())
skbuff: reuse NAPI skb cache on allocation path (__alloc_skb())
skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing

include/linux/skbuff.h | 4 +-
net/core/dev.c | 15 +-
net/core/skbuff.c | 392 ++++++++++++++++++++-------------------
net/netlink/af_netlink.c | 2 +-
4 files changed, 202 insertions(+), 211 deletions(-)

--
2.30.0