Re: Scaling problem with a lot of AF_PACKET sockets on differentinterfaces

From: Vitaly V. Bursov
Date: Fri Jun 07 2013 - 10:17:41 EST

07.06.2013 16:05, Daniel Borkmann ÐÐÑÐÑ:
On 06/07/2013 02:41 PM, Mike Galbraith wrote:
(CC's net-fu dojo)

On Fri, 2013-06-07 at 14:56 +0300, Vitaly V. Bursov wrote:

I have a Linux router with a lot of interfaces (hundreds or
thousands of VLANs) and an application that creates AF_PACKET
socket per interface and bind()s sockets to interfaces.

Each socket has attached BPF filter too.

The problem is observed on linux-3.8.13, but as far I can see
from the source the latest version has alike behavior.

I noticed that box has strange performance problems with
most of the CPU time spent in __netif_receive_skb:
86.15% [k] __netif_receive_skb
1.41% [k] _raw_spin_lock
1.09% [k] fib_table_lookup
0.99% [k] local_bh_enable_ip

and this the assembly with the "hot spot":
â shr $0x8,%r15w
â and $0xf,%r15d
0.00 â shl $0x4,%r15
â add $0xffffffff8165ec80,%r15
â mov (%r15),%rax
0.09 â mov %rax,0x28(%rsp)
â mov 0x28(%rsp),%rbp
0.01 â sub $0x28,%rbp
â jmp 5c7
1.72 â5b0: mov 0x28(%rbp),%rax
0.05 â mov 0x18(%rsp),%rbx
0.00 â mov %rax,0x28(%rsp)
0.03 â mov 0x28(%rsp),%rbp
5.67 â sub $0x28,%rbp
1.71 â5c7: lea 0x28(%rbp),%rax
1.73 â cmp %r15,%rax
â je 640
1.74 â cmp %r14w,0x0(%rbp)
â jne 5b0
81.36 â mov 0x8(%rbp),%rax
2.74 â cmp %rax,%r8
â je 5eb
1.37 â cmp 0x20(%rbx),%rax
â je 5eb
1.39 â cmp %r13,%rax
â jne 5b0
0.04 â5eb: test %r12,%r12
0.04 â je 6f4
â mov 0xc0(%rbx),%eax
â mov 0xc8(%rbx),%rdx
â testb $0x8,0x1(%rdx,%rax,1)
â jne 6d5

This corresponds to:

type = skb->protocol;
&ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
if (ptype->type == type &&
(ptype->dev == null_or_dev || ptype->dev == skb->dev ||
ptype->dev == orig_dev)) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;

Which works perfectly OK until there are a lot of AF_PACKET sockets, since
the socket adds a protocol to ptype list:

# cat /proc/net/ptype
Type Device Function
0800 eth2.1989 packet_rcv+0x0/0x400
0800 eth2.1987 packet_rcv+0x0/0x400
0800 eth2.1986 packet_rcv+0x0/0x400
0800 eth2.1990 packet_rcv+0x0/0x400
0800 eth2.1995 packet_rcv+0x0/0x400
0800 eth2.1997 packet_rcv+0x0/0x400
0800 eth2.1004 packet_rcv+0x0/0x400
0800 ip_rcv+0x0/0x310
0011 llc_rcv+0x0/0x3a0
0004 llc_rcv+0x0/0x3a0
0806 arp_rcv+0x0/0x150

And this obviously results in a huge performance penalty.

ptype_all, by the looks, should be the same.

Probably one way to fix this it to perform interface name matching in
af_packet handler, but there could be other cases, other protocols.

Ideas are welcome :)

Probably, that depends on _your scenario_ and/or BPF filter, but would it be
an alternative if you have only a few packet sockets (maybe one pinned to each
cpu) and cluster/load-balance them together via packet fanout? (Where you
bind the socket to ifindex 0, so that you get traffic from all devs...) That
would at least avoid that "hot spot", and you could post-process the interface
via sockaddr_ll. But I'd agree that this will not solve the actual problem you've
observed. ;-)

I was't aware of the ifindex 0 thing, it can help, thanks! Of course, if it'll
work for me (applications is a custom DHCP server) it'll surely
increase the overhead of BPF (I don't need to tap the traffic from all
interfaces), there are vlans, bridges and bonds - likely the server will receive
same packets multiple times and replies must be sent too...
but it still should be faster.

I just checked isc-dhcpd-V3.1.3 running on multiple interfaces
(another system with 2.6.32):
$ cat /proc/net/ptype
Type Device Function
ALL eth0 packet_rcv_spkt+0x0/0x190
ALL eth0.10 packet_rcv_spkt+0x0/0x190
ALL eth0.11 packet_rcv_spkt+0x0/0x190

As I understand, it'll hit this code:
list_for_each_entry_rcu(ptype, &ptype_all, list) {
if (!ptype->dev || ptype->dev == skb->dev) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
which scales the same.


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at