Re: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun

From: Willem de Bruijn
Date: Sun May 20 2018 - 17:57:15 EST


On Sat, May 19, 2018 at 8:07 AM, Jon Rosen <jrosen@xxxxxxxxx> wrote:
> Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which
> casues the ring to get corrupted by allowing multiple kernel threads
> to claim ownership of the same ring entry. Track ownership in a shadow
> ring structure to prevent other kernel threads from reusing the same
> entry before it's fully filled in, passed to user space, and then
> eventually passed back to the kernel for use with a new packet.
>
> Signed-off-by: Jon Rosen <jrosen@xxxxxxxxx>
> ---
>
> There is a bug in net/packet/af_packet.c:tpacket_rcv in how it manages
> the PACKET_RX_RING for versions TPACKET_V1 and TPACKET_V2. This bug makes
> it possible for multiple kernel threads to claim ownership of the same
> ring entry, corrupting the ring and the corresponding packet(s).
>
> These diffs are the second proposed solution, previous proposal was described
> in https://www.mail-archive.com/netdev@xxxxxxxxxxxxxxx/msg227468.html
> subject [RFC PATCH] packet: mark ring entry as in-use inside spin_lock
> to prevent RX ring overrun
>
> Those diffs would have changed the binary interface and have broken certain
> applications. Consensus was that such a change would be inappropriate.
>
> These new diffs use a shadow ring in kernel space for tracking intermediate
> state of an entry and prevent more than one kernel thread from simultaneously
> allocating a ring entry. This avoids any impact to the binary interface
> between kernel and userspace but comes at the additional cost of requiring a
> second spin_lock when passing ownership of a ring entry to userspace.
>
> Jon Rosen (1):
> packet: track ring entry use using a shadow ring to prevent RX ring
> overrun
>
> net/packet/af_packet.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++
> net/packet/internal.h | 14 +++++++++++
> 2 files changed, 78 insertions(+)
>

> @@ -2383,7 +2412,11 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
> #endif
>
> if (po->tp_version <= TPACKET_V2) {
> + spin_lock(&sk->sk_receive_queue.lock);
> __packet_set_status(po, h.raw, status);
> + packet_rx_shadow_release(rx_shadow_ring_entry);
> + spin_unlock(&sk->sk_receive_queue.lock);
> +
> sk->sk_data_ready(sk);

Thanks for continuing to look at this. I spent some time on it last time
around but got stuck, too.

This version takes an extra spinlock in the hot path. That will be very
expensive. Once we need to accept that, we could opt for a simpler
implementation akin to the one discussed in the previous thread:

stash a value in tp_padding or similar while tp_status remains
TP_STATUS_KERNEL to signal ownership to concurrent kernel
threads. The issue previously was that that field could not atomically
be cleared together with __packet_set_status. This is no longer
an issue when holding the queue lock.

With a field like tp_padding, unlike tp_len, it is arguably also safe to
clear it after flipping status (userspace should treat it as undefined).

With v1 tpacket_hdr, no explicit padding field is defined but due to
TPACKET_HDRLEN alignment it exists on both 32 and 64 bit
platforms.

The danger with using padding is that a process may write to it
and cause deadlock, of course. There is no logical reason for doing
so.