Re: [PATCH net-next v3 03/12] iavf: optimize Rx buffer allocation a bunch

From: Alexander H Duyck
Date: Wed May 31 2023 - 11:38:04 EST


On Tue, 2023-05-30 at 17:00 +0200, Alexander Lobakin wrote:
> The Rx hotpath code of IAVF is not well-optimized TBH. Before doing any
> further buffer model changes, shake it up a bit. Notably:
>
> 1. Cache more variables on the stack.
> DMA device, Rx page size, NTC -- these are the most common things
> used all throughout the hotpath, often in loops on each iteration.
> Instead of fetching (or even calculating, as with the page size) them
> from the ring all the time, cache them on the stack at the beginning
> of the NAPI polling callback. NTC will be written back at the end,
> the rest are used read-only, so no sync needed.
> 2. Don't move the recycled buffers around the ring.
> The idea of passing the page of the right-now-recycled-buffer to a
> different buffer, in this case, the first one that needs to be
> allocated, moreover, on each new frame, is fundamentally wrong. It
> involves a few o' fetches, branches and then writes (and one Rx
> buffer struct is at least 32 bytes) where they're completely unneeded,
> but gives no good -- the result is the same as if we'd recycle it
> inplace, at the same position where it was used. So drop this and let
> the main refilling function take care of all the buffers, which were
> processed and now need to be recycled/refilled.
> 3. Don't allocate with %GPF_ATOMIC on ifup.
> This involved introducing the @gfp parameter to a couple functions.
> Doesn't change anything for Rx -> softirq.
> 4. 1 budget unit == 1 descriptor, not skb.
> There could be underflow when receiving a lot of fragmented frames.
> If each of them would consist of 2 frags, it means that we'd process
> 64 descriptors at the point where we pass the 32th skb to the stack.
> But the driver would count that only as a half, which could make NAPI
> re-enable interrupts prematurely and create unnecessary CPU load.
> 5. Shortcut !size case.
> It's super rare, but possible -- for example, if the last buffer of
> the fragmented frame contained only FCS, which was then stripped by
> the HW. Instead of checking for size several times when processing,
> quickly reuse the buffer and jump to the skb fields part.
> 6. Refill the ring after finishing the polling loop.
> Previously, the loop wasn't starting a new iteration after the 64th
> desc, meaning that we were always leaving 16 buffers non-refilled
> until the next NAPI poll. It's better to refill them while they're
> still hot, so do that right after exiting the loop as well.
> For a full cycle of 64 descs, there will be 4 refills of 16 descs
> from now on.
>
> Function: add/remove: 4/2 grow/shrink: 0/5 up/down: 473/-647 (-174)
>
> + up to 2% performance.
>
> Signed-off-by: Alexander Lobakin <aleksander.lobakin@xxxxxxxxx>

What is the workload that is showing the performance improvement?

<...>

> @@ -1350,14 +1297,6 @@ static bool iavf_is_non_eop(struct iavf_ring *rx_ring,
> union iavf_rx_desc *rx_desc,
> struct sk_buff *skb)

I am pretty sure the skb pointer here is an unused variable. We needed
it for ixgbe to support RSC. I don't think you have any code that uses
it in this function and I know we removed the variable for i40e, see
commit d06e2f05b4f18 ("i40e: adjust i40e_is_non_eop").