RE: [PATCH net-next v8 3/3] net: axienet: Introduce dmaengine support

From: Pandey, Radhey Shyam
Date: Fri Oct 20 2023 - 15:35:44 EST


<snip>
> Minor nit: please use the reverse x-mas tree order

Sure, will fix it in next version.

<snip>
> You can avoid some duplicate code with:
> goto drop_skb;
>
> and adding at the bottom of this function:
>
> drop_skb:
> dev_kfree_skb_any(skb);
> return NETDEV_TX_OK;
>

Agree , will switch to it in next version.

<snip>

> You forgot to add the netif_txq_maybe_stop() call, as suggested by Jakub in
> the previous revision.

I was in an impression that these are multi queue specific APIs, so I skipped them.
But revisited the implementation and it seems clear now, and modified the driver
to use these lockless queue stopping / waking helpers.

+ tax = skb_get_tx_queue(lp->ndev, skb);
+ netdev_tx_sent_queue(txq, skb->len);
+ netif_txq_maybe_stop(txq, CIRC_SPACE(lp->tx_ring_head, lp->tx_ring_tail,
+ TX_BD_NUM_MAX), MAX_SKB_FRAGS + 1, 2 * MAX_SKB_FRAGS);


However, in netperf benchmark (TCP TX) I am seeing a dip in performance
(~35-40Mbps) when switching to these stop/wake helpers. Is it expected
considering extra logic in maintaining dynamic queue and these helpers?

Also, in throughput benchmarking there was no occurrence when the
queue was stopped/woken up.


Throughput: (10^6bits/sec)
915.55 (v8 - without using lockless queue stop/wake helpers)

======
Switch to lockless queue stop/wake helpers

# netperf -H 192.168.10.20 -t TCP_STREAM
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.20 () port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

131072 16384 16384 10.02 876.94


>
> > +
> > + skbuf_dma->skb = skb;
> > + skbuf_dma->sg_len = sg_len;
> > + dma_tx_desc->callback_param = lp;
> > + dma_tx_desc->callback_result = axienet_dma_tx_cb;
> > + dmaengine_submit(dma_tx_desc);
> > + dma_async_issue_pending(lp->tx_chan);
> > +
> > + return NETDEV_TX_OK;
> > +
> > +xmit_error_unmap_sg:
> > + dma_unmap_sg(lp->dev, skbuf_dma->sgl, sg_len, DMA_TO_DEVICE);
>
> If you need to drop the skb (as I suspect), you can reuse the drop_skb label
> here:
>
> drop_skb:
> dev_kfree_skb_any(skb);
> > + return NETDEV_TX_OK;

Yes , will make this change and this will also fix skb leak.

> > +}
> > +
> > /**
> > * axienet_tx_poll - Invoked once a transmit is completed by the
> > * Axi DMA Tx channel.
> > @@ -893,6 +1028,42 @@ axienet_start_xmit(struct sk_buff *skb, struct
> net_device *ndev)
> > return NETDEV_TX_OK;
> > }
> >
> > +/**
> > + * axienet_dma_rx_cb - DMA engine callback for RX channel.
> > + * @data: Pointer to the skbuf_dma_descriptor structure.
> > + * @result: error reporting through dmaengine_result.
> > + * This function is called by dmaengine driver for RX channel to
> > +notify
> > + * that the packet is received.
> > + */
> > +static void axienet_dma_rx_cb(void *data, const struct
> > +dmaengine_result *result) {
> > + struct axienet_local *lp = data;
> > + struct skbuf_dma_descriptor *skbuf_dma;
> > + size_t meta_len, meta_max_len, rx_len;
> > + struct sk_buff *skb;
> > + u32 *app_metadata;
>
> Minor nit: please respect the reverse x-mas tree order

Yes, will fix it in next version.

>
> > +
> > + skbuf_dma = axienet_get_rx_desc(lp, lp->rx_ring_tail++);
> > + skb = skbuf_dma->skb;
> > + app_metadata = dmaengine_desc_get_metadata_ptr(skbuf_dma-
> >desc, &meta_len,
> > + &meta_max_len);
> > + dma_unmap_single(lp->dev, skbuf_dma->dma_address, lp-
> >max_frm_size,
> > + DMA_FROM_DEVICE);
> > + /* TODO: Derive app word index programmatically */
> > + rx_len = (app_metadata[LEN_APP] & 0xFFFF);
> > + skb_put(skb, rx_len);
> > + skb->protocol = eth_type_trans(skb, lp->ndev);
> > + skb->ip_summed = CHECKSUM_NONE;
> > +
> > + __netif_rx(skb);
>
> It's a pity you can't leverage NAPI here.
>
> I think that could be doable as a follow-up, but I'm unsure if that would fit
> the DMA engine model: in this callback you could cache the ready dma index
> (a single range should suffice) and schedule the napi instance. The actual
> dma processing will be done in napi poll.
>
> Another possible follow-up could be introducing a "bulk" RX callback in the
> DMA engine, to mitigate the indirect call overhead on a burst of RX DMA
> completion - assuming the DMA engine actually generates such burst.

Agree , these are possible thoughts and will start working on it once
this baseline dmaengine support series is done.
>
> > + u64_stats_update_begin(&lp->rx_stat_sync);
> > + u64_stats_add(&lp->rx_packets, 1);
> > + u64_stats_add(&lp->rx_bytes, rx_len);
> > + u64_stats_update_end(&lp->rx_stat_sync);
> > + axienet_rx_submit_desc(lp->ndev);
> > + dma_async_issue_pending(lp->rx_chan);
> > +}
> > +
> > /**
> > * axienet_rx_poll - Triggered by RX ISR to complete the BD processing.
> > * @napi: Pointer to NAPI structure.
> > @@ -1126,6 +1297,150 @@ static irqreturn_t axienet_eth_irq(int irq,
> > void *_ndev)
> >
> > static void axienet_dma_err_handler(struct work_struct *work);
> >
> > +/**
> > + * axienet_rx_submit_desc - Submit the rx descriptors to dmaengine.
> > + * allocate skbuff, map the scatterlist and obtain a descriptor
> > + * and then add the callback information and submit descriptor.
> > + *
> > + * @ndev: net_device pointer
> > + *
> > + *Return: 0, on success.
> > + * non-zero error value on failure
> > + */
> > +static int axienet_rx_submit_desc(struct net_device *ndev) {
> > + struct dma_async_tx_descriptor *dma_rx_desc = NULL;
> > + struct axienet_local *lp = netdev_priv(ndev);
> > + struct skbuf_dma_descriptor *skbuf_dma;
> > + struct sk_buff *skb;
> > + dma_addr_t addr;
> > + int ret;
> > +
> > + skbuf_dma = axienet_get_rx_desc(lp, lp->rx_ring_head);
> > + if (!skbuf_dma)
> > + return -ENOMEM;
>
> Minor nit: here a newline would make the core more readable

Accepted , will add in next version.

>
> > + lp->rx_ring_head++;
> > + skb = netdev_alloc_skb(ndev, lp->max_frm_size);
> > + if (!skb)
> > + return -ENOMEM;
>
> Another possible follow-up: usually the skb header is allocated just before
> sending it to the network stack (e.g. just before the
> __netif_rx() call) to be cache friendly. Here you could allocate just the data
> part and later use e.g. build_skb_around()

Sure, will explore on it.
>
> > +
> > + sg_init_table(skbuf_dma->sgl, 1);
> > + addr = dma_map_single(lp->dev, skb->data, lp->max_frm_size,
> DMA_FROM_DEVICE);
> > + if (unlikely(dma_mapping_error(lp->dev, addr))) {
> > + if (net_ratelimit())
> > + netdev_err(ndev, "DMA mapping error\n");
> > + ret = -ENOMEM;
> > + goto rx_submit_err_free_skb;
> > + }
> > + sg_dma_address(skbuf_dma->sgl) = addr;
> > + sg_dma_len(skbuf_dma->sgl) = lp->max_frm_size;
> > + dma_rx_desc = dmaengine_prep_slave_sg(lp->rx_chan, skbuf_dma-
> >sgl,
> > + 1, DMA_DEV_TO_MEM,
> > + DMA_PREP_INTERRUPT);
> > + if (!dma_rx_desc) {
> > + ret = -EINVAL;
> > + goto rx_submit_err_unmap_skb;
> > + }
> > +
> > + skbuf_dma->skb = skb;
> > + skbuf_dma->dma_address = sg_dma_address(skbuf_dma->sgl);
> > + skbuf_dma->desc = dma_rx_desc;
> > + dma_rx_desc->callback_param = lp;
> > + dma_rx_desc->callback_result = axienet_dma_rx_cb;
> > + dmaengine_submit(dma_rx_desc);
> > +
> > + return 0;
> > +
> > +rx_submit_err_unmap_skb:
> > + dma_unmap_single(lp->dev, addr, lp->max_frm_size,
> DMA_FROM_DEVICE);
> > +rx_submit_err_free_skb:
> > + dev_kfree_skb(skb);
> > + return ret;
>
> It looks like the error code is ignored by the caller. Possibly you can change
> this to a 'void' function.

will make it void in next version.

Thanks,
Radhey