Re: [net-next PATCH v4 7/7] net: ravb: Allocate RX buffers via page pool

From: Paul Barker
Date: Thu May 30 2024 - 05:21:45 EST


On 29/05/2024 21:52, Sergey Shtylyov wrote:
> On 5/28/24 6:03 PM, Paul Barker wrote:
>
>> This patch makes multiple changes that can't be separated:
>>
>> 1) Allocate plain RX buffers via a page pool instead of allocating
>> SKBs, then use build_skb() when a packet is received.
>> 2) For GbEth IP, reduce the RX buffer size to 2kB.
>> 3) For GbEth IP, merge packets which span more than one RX descriptor
>> as SKB fragments instead of copying data.
>>
>> Implementing (1) without (2) would require the use of an order-1 page
>> pool (instead of an order-0 page pool split into page fragments) for
>> GbEth.
>>
>> Implementing (2) without (3) would leave us no space to re-assemble
>> packets which span more than one RX descriptor.
>>
>> Implementing (3) without (1) would not be possible as the network stack
>> expects to use put_page() or page_pool_put_page() to free SKB fragments
>> after an SKB is consumed.
>>
>> RX checksum offload support is adjusted to handle both linear and
>> nonlinear (fragmented) packets.
>>
>> This patch gives the following improvements during testing with iperf3.
>>
>> * RZ/G2L:
>> * TCP RX: same bandwidth at -43% CPU load (70% -> 40%)
>> * UDP RX: same bandwidth at -17% CPU load (88% -> 74%)
>>
>> * RZ/G2UL:
>> * TCP RX: +30% bandwidth (726Mbps -> 941Mbps)
>> * UDP RX: +417% bandwidth (108Mbps -> 558Mbps)
>>
>> * RZ/G3S:
>> * TCP RX: +64% bandwidth (562Mbps -> 920Mbps)
>> * UDP RX: +420% bandwidth (90Mbps -> 468Mbps)
>>
>> * RZ/Five:
>> * TCP RX: +217% bandwidth (145Mbps -> 459Mbps)
>> * UDP RX: +470% bandwidth (20Mbps -> 114Mbps)
>>
>> There is no significant impact on bandwidth or CPU load in testing on
>> RZ/G2H or R-Car M3N.
>>
>> Signed-off-by: Paul Barker <paul.barker.ct@xxxxxxxxxxxxxx>
>> ---
>> Changes v3->v4:
>> * Used a separate page pool for each RX queue.
>> * Passed struct ravb_rx_desc to ravb_alloc_rx_buffer() so that we can
>> simplify the calling function.
>> * Explained the calculation of rx_desc->ds_cc.
>> * Added handling of nonlinear SKBs in ravb_rx_csum_gbeth().
>>
>> drivers/net/ethernet/renesas/ravb.h | 10 +-
>> drivers/net/ethernet/renesas/ravb_main.c | 230 ++++++++++++++---------
>> 2 files changed, 146 insertions(+), 94 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/renesas/ravb.h b/drivers/net/ethernet/renesas/ravb.h
>> index 6a7aa7dd17e6..f2091a17fcf7 100644
>> --- a/drivers/net/ethernet/renesas/ravb.h
>> +++ b/drivers/net/ethernet/renesas/ravb.h
> [...]> @@ -1094,7 +1099,8 @@ struct ravb_private {
>> struct ravb_tx_desc *tx_ring[NUM_TX_QUEUE];
>> void *tx_align[NUM_TX_QUEUE];
>> struct sk_buff *rx_1st_skb;
>> - struct sk_buff **rx_skb[NUM_RX_QUEUE];
>> + struct page_pool *rx_pool[NUM_RX_QUEUE];
>
> Don't we need #include <net/page_pool/types.h>

Yes. I got away with it as ravb_main.c includes
<net/page_pool/helpers.h> before including "ravb.h", but the header
shouldn't assume that.

>
> [...]
>> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
>> index dd92f074881a..bb7f7d44be6e 100644
>> --- a/drivers/net/ethernet/renesas/ravb_main.c
>> +++ b/drivers/net/ethernet/renesas/ravb_main.c
> [...]
>> @@ -317,35 +289,56 @@ static void ravb_ring_free(struct net_device *ndev, int q)
>> priv->tx_skb[q] = NULL;
>> }
>>
>> +static int
>> +ravb_alloc_rx_buffer(struct net_device *ndev, int q, u32 entry, gfp_t gfp_mask,
>> + struct ravb_rx_desc *rx_desc)
>> +{
>> + struct ravb_private *priv = netdev_priv(ndev);
>> + const struct ravb_hw_info *info = priv->info;
>> + struct ravb_rx_buffer *rx_buff = &priv->rx_buffers[q][entry];
>> + dma_addr_t dma_addr;
>> + unsigned int size;
>> +
>> + size = info->rx_buffer_size;
>> + rx_buff->page = page_pool_alloc(priv->rx_pool[q], &rx_buff->offset, &size,
>> + gfp_mask);
>> + if (unlikely(!rx_buff->page)) {
>> + /* We just set the data size to 0 for a failed mapping
>> + * which should prevent DMA from happening...
>> + */
>> + rx_desc->ds_cc = cpu_to_le16(0);
>> + return -ENOMEM;
>> + }
>> +
>> + dma_addr = page_pool_get_dma_addr(rx_buff->page) + rx_buff->offset;
>> + dma_sync_single_for_device(ndev->dev.parent, dma_addr,
>> + info->rx_buffer_size, DMA_FROM_DEVICE);
>
> Do we really need this call?

Looking at .config I see CONFIG_DMA_NEED_SYNC=y so yes I think this is
needed.

>
>> + rx_desc->dptr = cpu_to_le32(dma_addr);
>> +
>> + /* The end of the RX buffer is used to store skb shared data, so we need
>> + * to ensure that the hardware leaves enough space for this.
>> + */
>> + rx_desc->ds_cc = cpu_to_le16(info->rx_buffer_size
>> + - SKB_DATA_ALIGN(sizeof(struct skb_shared_info))
>
> Please leave the - operator on the previous line...

Ack.

>
>> + - ETH_FCS_LEN + sizeof(__sum16));
>
> Here as well...

Ack.

>
>> + return 0;
>> +}
>> +
>> static u32
>> ravb_rx_ring_refill(struct net_device *ndev, int q, u32 count, gfp_t gfp_mask)
>> {
>> struct ravb_private *priv = netdev_priv(ndev);
>> - const struct ravb_hw_info *info = priv->info;
>> struct ravb_rx_desc *rx_desc;
>> - dma_addr_t dma_addr;
>> u32 i, entry;
>>
>> for (i = 0; i < count; i++) {
>> entry = (priv->dirty_rx[q] + i) % priv->num_rx_ring[q];
>> rx_desc = ravb_rx_get_desc(priv, q, entry);
>> - rx_desc->ds_cc = cpu_to_le16(info->rx_max_desc_use);
>>
>> - if (!priv->rx_skb[q][entry]) {
>> - priv->rx_skb[q][entry] = ravb_alloc_skb(ndev, info, gfp_mask);
>> - if (!priv->rx_skb[q][entry])
>> + if (!priv->rx_buffers[q][entry].page) {
>> + if (unlikely(ravb_alloc_rx_buffer(ndev, q, entry,
>
> Well, IIRC Greg KH is against using unlikely() unless you have actually
> instrumented the code and this gives an improvement... have you? :-)

My understanding was that we should use unlikely() for error checking in
hot code paths where we want the "good" path to be optimised. I can drop
this if I'm wrong though.

>
> [...]
>> @@ -727,12 +739,22 @@ static void ravb_rx_csum_gbeth(struct sk_buff *skb)
>> if (unlikely(skb->len < sizeof(__sum16) * 2))
>> return;
>>
>> - hw_csum = skb_tail_pointer(skb) - sizeof(__sum16);
>> + if (skb_is_nonlinear(skb)) {
>> + last_frag = &shinfo->frags[shinfo->nr_frags - 1];
>> + hw_csum = skb_frag_address(last_frag) + skb_frag_size(last_frag) - sizeof(__sum16);
>> + } else {
>> + hw_csum = skb_tail_pointer(skb) - sizeof(__sum16);
>> + }
>
> We can do the subtraction only once here...

Ack. I'll pull that out of the if.

>
> [...]
>> @@ -816,14 +824,26 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
>> if (desc_status & MSC_CEEF)
>> stats->rx_missed_errors++;
>> } else {
>> + struct ravb_rx_buffer *rx_buff = &priv->rx_buffers[q][entry];
>> + void *rx_addr = page_address(rx_buff->page) + rx_buff->offset;
>
> Need an empty line here...

Ack.

>
>> die_dt = desc->die_dt & 0xF0;
>> - skb = ravb_get_skb_gbeth(ndev, entry, desc);
>> + dma_sync_single_for_cpu(ndev->dev.parent, le32_to_cpu(desc->dptr),
>> + desc_len, DMA_FROM_DEVICE);
>> +
>> switch (die_dt) {
>> case DT_FSINGLE:
>> case DT_FSTART:
>> /* Start of packet:
>> - * Set initial data length.
>> + * Prepare an SKB and add initial data.
>
> I'd prefer calling it skb in the comments...

Ack.

>
> [...]
>> @@ -865,7 +894,16 @@ static int ravb_rx_gbeth(struct net_device *ndev, int budget, int q)
>> stats->rx_bytes += skb->len;
>> napi_gro_receive(&priv->napi[q], skb);
>> rx_packets++;
>> +
>> + /* Clear rx_1st_skb so that it will only be
>> + * non-NULL when valid.
>> + */
>> + if (die_dt == DT_FEND)
>> + priv->rx_1st_skb = NULL;
>
> Hm, can't we do this under *case* DT_FEND above?

It makes more logical sense to me to do this as the last step, but I
guess it's a little more optimal to do it earlier. I'll move it.

Thanks,

--
Paul Barker

Attachment: OpenPGP_0x27F4B3459F002257.asc
Description: OpenPGP public key

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature