RE: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun

From: Jon Rosen (jrosen)
Date: Mon May 21 2018 - 13:22:08 EST


On Monday, May 21, 2018 1:07 PM, Willem de Bruijn
<willemdebruijn.kernel@xxxxxxxxx> wrote:
>On Mon, May 21, 2018 at 8:57 AM, Jon Rosen (jrosen) <jrosen@xxxxxxxxx> wrote:
>> On Sunday, May 20, 2018 7:22 PM, Willem de Bruijn
>> <willemdebruijn.kernel@xxxxxxxxx> wrote:
>>> On Sun, May 20, 2018 at 6:51 PM, Willem de Bruijn
>>> <willemdebruijn.kernel@xxxxxxxxx> wrote:
>>>> On Sat, May 19, 2018 at 8:07 AM, Jon Rosen <jrosen@xxxxxxxxx> wrote:
>>>>> Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which
>>>>> casues the ring to get corrupted by allowing multiple kernel threads
>>>>> to claim ownership of the same ring entry. Track ownership in a shadow
>>>>> ring structure to prevent other kernel threads from reusing the same
>>>>> entry before it's fully filled in, passed to user space, and then
>>>>> eventually passed back to the kernel for use with a new packet.
>>>>>
>>>>> Signed-off-by: Jon Rosen <jrosen@xxxxxxxxx>
>>>>> ---
>>>>>
>>>>> There is a bug in net/packet/af_packet.c:tpacket_rcv in how it manages
>>>>> the PACKET_RX_RING for versions TPACKET_V1 and TPACKET_V2. This bug makes
>>>>> it possible for multiple kernel threads to claim ownership of the same
>>>>> ring entry, corrupting the ring and the corresponding packet(s).
>>>>>
>>>>> These diffs are the second proposed solution, previous proposal was described
>>>>> in https://www.mail-archive.com/netdev@xxxxxxxxxxxxxxx/msg227468.html
>>>>> subject [RFC PATCH] packet: mark ring entry as in-use inside spin_lock
>>>>> to prevent RX ring overrun
>>>>>
>>>>> Those diffs would have changed the binary interface and have broken certain
>>>>> applications. Consensus was that such a change would be inappropriate.
>>>>>
>>>>> These new diffs use a shadow ring in kernel space for tracking intermediate
>>>>> state of an entry and prevent more than one kernel thread from simultaneously
>>>>> allocating a ring entry. This avoids any impact to the binary interface
>>>>> between kernel and userspace but comes at the additional cost of requiring a
>>>>> second spin_lock when passing ownership of a ring entry to userspace.
>>>>>
>>>>> Jon Rosen (1):
>>>>> packet: track ring entry use using a shadow ring to prevent RX ring
>>>>> overrun
>>>>>
>>>>> net/packet/af_packet.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> net/packet/internal.h | 14 +++++++++++
>>>>> 2 files changed, 78 insertions(+)
>>>>>
>>>>
>>>>> @@ -2383,7 +2412,11 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
>>>>> #endif
>>>>>
>>>>> if (po->tp_version <= TPACKET_V2) {
>>>>> + spin_lock(&sk->sk_receive_queue.lock);
>>>>> __packet_set_status(po, h.raw, status);
>>>>> + packet_rx_shadow_release(rx_shadow_ring_entry);
>>>>> + spin_unlock(&sk->sk_receive_queue.lock);
>>>>> +
>>>>> sk->sk_data_ready(sk);
>>>>
>>>> Thanks for continuing to look at this. I spent some time on it last time
>>>> around but got stuck, too.
>>>>
>>>> This version takes an extra spinlock in the hot path. That will be very
>>>> expensive. Once we need to accept that, we could opt for a simpler
>>>> implementation akin to the one discussed in the previous thread:
>>>>
>>>> stash a value in tp_padding or similar while tp_status remains
>>>> TP_STATUS_KERNEL to signal ownership to concurrent kernel
>>>> threads. The issue previously was that that field could not atomically
>>>> be cleared together with __packet_set_status. This is no longer
>>>> an issue when holding the queue lock.
>>>>
>>>> With a field like tp_padding, unlike tp_len, it is arguably also safe to
>>>> clear it after flipping status (userspace should treat it as undefined).
>>>>
>>>> With v1 tpacket_hdr, no explicit padding field is defined but due to
>>>> TPACKET_HDRLEN alignment it exists on both 32 and 64 bit
>>>> platforms.
>>>>
>>>> The danger with using padding is that a process may write to it
>>>> and cause deadlock, of course. There is no logical reason for doing
>>>> so.
>>>
>>> For the ring, there is no requirement to allocate exactly the amount
>>> specified by the user request. Safer than relying on shared memory
>>> and simpler than the extra allocation in this patch would be to allocate
>>> extra shadow memory at the end of the ring (and not mmap that).
>>>
>>> That still leaves an extra cold cacheline vs using tp_padding.
>>
>> Given my lack of experience and knowledge in writing kernel code
>> it was easier for me to allocate the shadow ring as a separate
>> structure. Of course it's not about me and my skills so if it's
>> more appropriate to allocate at the tail of the existing ring
>> then certainly I can look at doing that.
>>
>> I think the bigger issues as you've pointed out are the cost of
>> the additional spin lock and should the additional state be
>> stored in-band (fewer cache lines) or out-of band (less risk of
>> breaking due to unpredictable application behavior).
>
> We don't need the spinlock if clearing the shadow byte after
> setting the status to user.
>
> Worst case, user will set it back to kernel while the shadow
> byte is not cleared yet and the next producer will drop a packet.
> But next producers will make progress, so there is no deadlock
> or corruption.

I thought so too for a while but after spending more time than I
care to admit I relized the following sequence was occuring:

Core A Core B
------ ------
- Enter spin_lock
- Get tp_status of head (X)
tp_status == 0
- Check inuse
inuse == 0
- Allocate entry X
advance head (X+1)
set inuse=1
- Exit spin_lock

<very long delay>

<allocate N-1 entries
where N = size of ring>

- Enter spin_lock
- get tp_status of head (X+N)
tp_status == 0 (but slot
in use for X on core A)

- write tp_status of <--- trouble!
X = TP_STATUS_USER <--- trouble!
- write inuse=0 <--- trouble!

- Check inuse
inuse == 0
- Allocate entry X+N
advance head (X+N+1)
set inuse=1
- Exit spin_lock


At this point Core A just passed slot X to userspace with a
packet and Core B has just been assigned slot X+N (same slot as
X) for it's new packet. Both cores A and B end up filling in that
slot. Tracking ths donw was one of the reasons it took me a
while to produce these updated diffs.


>
> It probably does require a shadow structure as opposed to a
> padding byte to work with the long tail of (possibly broken)
> applications, sadly.

I agree.

>
> A setsockopt for userspace to signal a stricter interpretation of
> tp_status to elide the shadow hack could then be considered.
> It's not pretty. Either way, no full new version is required.
>
>> As much as I would like to find a solution that doesn't require
>> the spin lock I have yet to do so. Maybe the answer is that
>> existing applications will need to suffer the performance impact
>> but a new version or option for TPACKET_V1/V2 could be added to
>> indicate strict adherence of the TP_STATUS_USER bit and then the
>> original diffs could be used.
>>
>> There is another option I was considering but have yet to try
>> which would avoid needing a shadow ring by using counter(s) to
>> track maximum sequence number queued to userspace vs. the next
>> sequence number to be allocated in the ring. If the difference
>> is greater than the size of the ring then the ring can be
>> considered full and the allocation would fail. Of course this may
>> create an additional hotspot between cores, not sure if that
>> would be significant or not.
>
> Please do have a look, but I don't think that this will work in this
> case in practice. It requires tracking the producer tail. Updating
> the slowest writer requires probing each subsequent slot's status
> byte to find the new tail, which is a lot of (by then cold) cacheline
> reads.

I've thought about it a little more and am not convinced it's
workable but I'll spend a little more time on it before giving
up.