Re: [net-next v5 09/12] net: bnxt: Add SW GSO completion and teardown support
From: Joe Damato
Date: Thu Mar 26 2026 - 13:07:56 EST
On Thu, Mar 26, 2026 at 01:39:17PM +0100, Paolo Abeni wrote:
> On 3/23/26 7:38 PM, Joe Damato wrote:
> > Update __bnxt_tx_int and bnxt_free_one_tx_ring_skbs to handle SW GSO
> > segments:
> >
> > - MID segments: adjust tx_pkts/tx_bytes accounting and skip skb free
> > (the skb is shared across all segments and freed only once)
> >
> > - LAST segments: if the DMA IOVA path was used, use dma_iova_destroy to
> > tear down the contiguous mapping. On the fallback path, payload DMA
> > unmapping is handled by the existing per-BD dma_unmap_len walk.
> >
> > Both MID and LAST completions advance tx_inline_cons to release the
> > segment's inline header slot back to the ring.
> >
> > is_sw_gso is initialized to zero, so the new code paths are not run.
> >
> > Suggested-by: Jakub Kicinski <kuba@xxxxxxxxxx>
> > Reviewed-by: Pavan Chebbi <pavan.chebbi@xxxxxxxxxxxx>
> > Signed-off-by: Joe Damato <joe@xxxxxxx>
> > ---
> > v5:
> > - Added Pavan's Reviewed-by. No functional changes.
> >
> > v3:
> > - completion paths updated to use DMA IOVA APIs to teardown mappings.
> >
> > rfcv2:
> > - Update the shared header buffer consumer on TX completion.
> >
> > drivers/net/ethernet/broadcom/bnxt/bnxt.c | 82 +++++++++++++++++--
> > .../net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 19 ++++-
> > 2 files changed, 91 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > index 2759a4e2b148..40a16f96feba 100644
> > --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > @@ -74,6 +74,8 @@
> > #include "bnxt_debugfs.h"
> > #include "bnxt_coredump.h"
> > #include "bnxt_hwmon.h"
> > +#include "bnxt_gso.h"
> > +#include <net/tso.h>
> >
> > #define BNXT_TX_TIMEOUT (5 * HZ)
> > #define BNXT_DEF_MSG_ENABLE (NETIF_MSG_DRV | NETIF_MSG_HW | \
> > @@ -817,12 +819,13 @@ static bool __bnxt_tx_int(struct bnxt *bp, struct bnxt_tx_ring_info *txr,
> > bool rc = false;
> >
> > while (RING_TX(bp, cons) != hw_cons) {
> > - struct bnxt_sw_tx_bd *tx_buf;
> > + struct bnxt_sw_tx_bd *tx_buf, *head_buf;
> > struct sk_buff *skb;
> > bool is_ts_pkt;
> > int j, last;
> >
> > tx_buf = &txr->tx_buf_ring[RING_TX(bp, cons)];
> > + head_buf = tx_buf;
> > skb = tx_buf->skb;
> >
> > if (unlikely(!skb)) {
> > @@ -869,6 +872,23 @@ static bool __bnxt_tx_int(struct bnxt *bp, struct bnxt_tx_ring_info *txr,
> > DMA_TO_DEVICE, 0);
> > }
> > }
> > +
> > + if (unlikely(head_buf->is_sw_gso)) {
> > + txr->tx_inline_cons++;
> > + if (head_buf->is_sw_gso == BNXT_SW_GSO_LAST) {
> > + if (dma_use_iova(&head_buf->iova_state))
>
> I'm likely lost, but AFAICS the previous patch/bnxt_sw_udp_gso_xmit()
> initialize head_buf->iova_state only when
> `dma_use_iova(&head_buf->iova_state) == true`. I.e. in fallback scenario
> the previous iova_state is maintained.
Note that calling dma_iova_try_alloc zeroes the state before returning whether
the IOVA DMA API can be used or not and I call that uncoditionally (see
below).
> Additionally AFAICS dma_iova_destroy does not clear `head_buf->iova_state`.
That's my understanding, too, that dma_iova_destroy doesn't clear the state.
> It looks like that 2 consecutive skb hitting the same slot use a
> different dma mapping strategy (fallback vs iova) bat things will
> happen?!? should the previous patch always initializing
> head_buf->iova_state?
AFAICT, to switch the IOMMU domain would require unbind the device, changing
the IOMMU type, and re-binding the device... which would destroy all the rings
in the process and thus this wouldn't happen.
The only way I could potentially imagine this happening would be in extreme
IOVA pressure (maybe?):
- packet A in slot N, dma_iova_try_alloc suceeds -> head_buf->iova_state
copied
- completion the packet occurs, dma_iova_destroy is called,
head_buf->iova_state is not cleared
- packet B in slot N, dma_iova_try_alloc fails due to IOVA pressure...
head_buf->iova_state is stale
I'm pretty skeptical that this is a realistic case, TBH.
That said and since it seems my v5 got CR, I can send a v6 with this slight
change to address the case you've mentioned above.
I'll send in a couple hours unless I hear otherwise:
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_gso.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_gso.c
index 9c30ee063ef5..7c198847a771 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_gso.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_gso.c
@@ -142,8 +142,12 @@ netdev_tx_t bnxt_sw_udp_gso_xmit(struct bnxt *bp,
tx_buf->is_sw_gso = last ? BNXT_SW_GSO_LAST : BNXT_SW_GSO_MID;
- /* Store IOVA state on the last segment for completion */
- if (last && tso_dma_map_use_iova(&map)) {
+ /* Store IOVA state on the last segment for completion.
+ * Always copy so that a stale iova_state from a prior
+ * occupant of this ring slot cannot be misread by
+ * dma_use_iova() in the completion path.
+ */
+ if (last) {
tx_buf->iova_state = map.iova_state;
tx_buf->iova_total_len = map.total_len;
}