Re: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun
From: Willem de Bruijn
Date: Sun May 20 2018 - 18:27:53 EST
On Sun, May 20, 2018 at 6:51 PM, Willem de Bruijn
<willemdebruijn.kernel@xxxxxxxxx> wrote:
> On Sat, May 19, 2018 at 8:07 AM, Jon Rosen <jrosen@xxxxxxxxx> wrote:
>> Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which
>> casues the ring to get corrupted by allowing multiple kernel threads
>> to claim ownership of the same ring entry. Track ownership in a shadow
>> ring structure to prevent other kernel threads from reusing the same
>> entry before it's fully filled in, passed to user space, and then
>> eventually passed back to the kernel for use with a new packet.
>>
>> Signed-off-by: Jon Rosen <jrosen@xxxxxxxxx>
>> ---
>>
>> There is a bug in net/packet/af_packet.c:tpacket_rcv in how it manages
>> the PACKET_RX_RING for versions TPACKET_V1 and TPACKET_V2. This bug makes
>> it possible for multiple kernel threads to claim ownership of the same
>> ring entry, corrupting the ring and the corresponding packet(s).
>>
>> These diffs are the second proposed solution, previous proposal was described
>> in https://www.mail-archive.com/netdev@xxxxxxxxxxxxxxx/msg227468.html
>> subject [RFC PATCH] packet: mark ring entry as in-use inside spin_lock
>> to prevent RX ring overrun
>>
>> Those diffs would have changed the binary interface and have broken certain
>> applications. Consensus was that such a change would be inappropriate.
>>
>> These new diffs use a shadow ring in kernel space for tracking intermediate
>> state of an entry and prevent more than one kernel thread from simultaneously
>> allocating a ring entry. This avoids any impact to the binary interface
>> between kernel and userspace but comes at the additional cost of requiring a
>> second spin_lock when passing ownership of a ring entry to userspace.
>>
>> Jon Rosen (1):
>> packet: track ring entry use using a shadow ring to prevent RX ring
>> overrun
>>
>> net/packet/af_packet.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++
>> net/packet/internal.h | 14 +++++++++++
>> 2 files changed, 78 insertions(+)
>>
>
>> @@ -2383,7 +2412,11 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
>> #endif
>>
>> if (po->tp_version <= TPACKET_V2) {
>> + spin_lock(&sk->sk_receive_queue.lock);
>> __packet_set_status(po, h.raw, status);
>> + packet_rx_shadow_release(rx_shadow_ring_entry);
>> + spin_unlock(&sk->sk_receive_queue.lock);
>> +
>> sk->sk_data_ready(sk);
>
> Thanks for continuing to look at this. I spent some time on it last time
> around but got stuck, too.
>
> This version takes an extra spinlock in the hot path. That will be very
> expensive. Once we need to accept that, we could opt for a simpler
> implementation akin to the one discussed in the previous thread:
>
> stash a value in tp_padding or similar while tp_status remains
> TP_STATUS_KERNEL to signal ownership to concurrent kernel
> threads. The issue previously was that that field could not atomically
> be cleared together with __packet_set_status. This is no longer
> an issue when holding the queue lock.
>
> With a field like tp_padding, unlike tp_len, it is arguably also safe to
> clear it after flipping status (userspace should treat it as undefined).
>
> With v1 tpacket_hdr, no explicit padding field is defined but due to
> TPACKET_HDRLEN alignment it exists on both 32 and 64 bit
> platforms.
>
> The danger with using padding is that a process may write to it
> and cause deadlock, of course. There is no logical reason for doing
> so.
For the ring, there is no requirement to allocate exactly the amount
specified by the user request. Safer than relying on shared memory
and simpler than the extra allocation in this patch would be to allocate
extra shadow memory at the end of the ring (and not mmap that).
That still leaves an extra cold cacheline vs using tp_padding.