RE: [PATCH net] net: stmmac: Stop using a single dma_map() for multiple descriptors

From: Suraj Jaiswal
Date: Tue Sep 10 2024 - 08:47:44 EST




-----Original Message-----
From: Andrew Halaney <ahalaney@xxxxxxxxxx>
Sent: Wednesday, September 4, 2024 3:47 AM
To: Suraj Jaiswal (QUIC) <quic_jsuraj@xxxxxxxxxxx>
Cc: Vinod Koul <vkoul@xxxxxxxxxx>; bhupesh.sharma@xxxxxxxxxx; Andy Gross <agross@xxxxxxxxxx>; Bjorn Andersson <andersson@xxxxxxxxxx>; Konrad Dybcio <konrad.dybcio@xxxxxxxxxx>; David S. Miller <davem@xxxxxxxxxxxxx>; Eric Dumazet <edumazet@xxxxxxxxxx>; Jakub Kicinski <kuba@xxxxxxxxxx>; Rob Herring <robh+dt@xxxxxxxxxx>; Krzysztof Kozlowski <krzysztof.kozlowski+dt@xxxxxxxxxx>; Conor Dooley <conor+dt@xxxxxxxxxx>; Alexandre Torgue <alexandre.torgue@xxxxxxxxxxx>; Jose Abreu <joabreu@xxxxxxxxxxxx>; Maxime Coquelin <mcoquelin.stm32@xxxxxxxxx>; netdev@xxxxxxxxxxxxxxx; linux-arm-msm@xxxxxxxxxxxxxxx; devicetree@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; linux-stm32@xxxxxxxxxxxxxxxxxxxxxxxxxxxx; Prasad Sodagudi <psodagud@xxxxxxxxxxx>; Rob Herring <robh@xxxxxxxxxx>; kernel <kernel@xxxxxxxxxxx>
Subject: Re: [PATCH net] net: stmmac: Stop using a single dma_map() for multiple descriptors

WARNING: This email originated from outside of Qualcomm. Please be wary of any links or attachments, and do not enable macros.

On Mon, Sep 02, 2024 at 03:24:36PM GMT, Suraj Jaiswal wrote:
> Currently same page address is shared
> between multiple buffer addresses and causing smmu fault for other
> descriptor if address hold by one descriptor got cleaned.
> Allocate separate buffer address for each descriptor for TSO path so
> that if one descriptor cleared it should not clean other descriptor
> address.

I think maybe you mean something like:

Currently in the TSO case a page is mapped with dma_map_single(), and then
the resulting dma address is referenced (and offset) by multiple
descriptors until the whole region is programmed into the descriptors.

This makes it possible for stmmac_tx_clean() to dma_unmap() the first of the
already processed descriptors, while the rest are still being processed
by the DMA engine. This leads to an iommu fault due to the DMA engine using
unmapped memory as seen below:

<insert splat>

You can reproduce this easily by <reproduction steps>.

To fix this, let's map each descriptor's memory reference individually.
This way there's no risk of unmapping a region that's still being
referenced by the DMA engine in a later descriptor.

That's a bit nitpicky wording wise, but your first sentence is hard for me to follow (buffer addresses seems to mean descriptor?). I think showing a splat and mentioning how to reproduce is always a bonus as well.

>
> Signed-off-by: Suraj Jaiswal <quic_jsuraj@xxxxxxxxxxx>

Fixes: ?

At a quick glance I think its f748be531d70 ("stmmac: support new GMAC4")

> ---
>
> Changes since v2:
> - Fixed function description
> - Fixed handling of return value.
>

This is v1 as far as netdev is concerned :)

>
> .../net/ethernet/stmicro/stmmac/stmmac_main.c | 63
> ++++++++++++-------
> 1 file changed, 42 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index 83b654b7a9fd..5948774c403f 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -4136,16 +4136,18 @@ static bool stmmac_vlan_insert(struct
> stmmac_priv *priv, struct sk_buff *skb,
> /**
> * stmmac_tso_allocator - close entry point of the driver
> * @priv: driver private structure
> - * @des: buffer start address
> + * @addr: Contains either skb frag address or skb->data address
> * @total_len: total length to fill in descriptors
> * @last_segment: condition for the last descriptor
> * @queue: TX queue index
> + * @is_skb_frag: condition to check whether skb data is part of
> + fragment or not
> * Description:
> * This function fills descriptor and request new descriptors according to
> * buffer length to fill
> + * This function returns 0 on success else -ERRNO on fail
> */
> -static void stmmac_tso_allocator(struct stmmac_priv *priv, dma_addr_t des,
> - int total_len, bool last_segment, u32 queue)
> +static int stmmac_tso_allocator(struct stmmac_priv *priv, void *addr,
> + int total_len, bool last_segment, u32
> +queue, bool is_skb_frag)
> {
> struct stmmac_tx_queue *tx_q = &priv->dma_conf.tx_queue[queue];
> struct dma_desc *desc;
> @@ -4153,6 +4155,8 @@ static void stmmac_tso_allocator(struct stmmac_priv *priv, dma_addr_t des,
> int tmp_len;
>
> tmp_len = total_len;
> + unsigned int offset = 0;
> + unsigned char *data = addr;

Reverse xmas tree order, offset is always set below so you could just declare it, and data really doesn't seem necessary to me vs using addr directly.
<Suraj> done.

https://docs.kernel.org/process/maintainer-netdev.html#local-variable-ordering-reverse-xmas-tree-rcs

>
> while (tmp_len > 0) {
> dma_addr_t curr_addr;
> @@ -4161,20 +4165,44 @@ static void stmmac_tso_allocator(struct stmmac_priv *priv, dma_addr_t des,
> priv->dma_conf.dma_tx_size);
> WARN_ON(tx_q->tx_skbuff[tx_q->cur_tx]);
>
> + buff_size = tmp_len >= TSO_MAX_BUFF_SIZE ?
> + TSO_MAX_BUFF_SIZE : tmp_len;
> +
> if (tx_q->tbs & STMMAC_TBS_AVAIL)
> desc = &tx_q->dma_entx[tx_q->cur_tx].basic;
> else
> desc = &tx_q->dma_tx[tx_q->cur_tx];
>
> - curr_addr = des + (total_len - tmp_len);
> + offset = total_len - tmp_len;
> + if (!is_skb_frag) {
> + curr_addr = dma_map_single(priv->device, data + offset, buff_size,
> + DMA_TO_DEVICE);

Instead of defining "data" above, can't you just use "addr" directly here?

> +
> + if (dma_mapping_error(priv->device, curr_addr))
> + return -ENOMEM;
> +
> + tx_q->tx_skbuff_dma[tx_q->cur_tx].buf = curr_addr;
> + tx_q->tx_skbuff_dma[tx_q->cur_tx].len = buff_size;
> + tx_q->tx_skbuff_dma[tx_q->cur_tx].map_as_page = false;
> + tx_q->tx_skbuff_dma[tx_q->cur_tx].buf_type = STMMAC_TXBUF_T_SKB;
> + } else {
> + curr_addr = skb_frag_dma_map(priv->device, addr, offset,
> + buff_size,
> + DMA_TO_DEVICE);
> +
> + if (dma_mapping_error(priv->device, curr_addr))
> + return -ENOMEM;
> +
> + tx_q->tx_skbuff_dma[tx_q->cur_tx].buf = curr_addr;
> + tx_q->tx_skbuff_dma[tx_q->cur_tx].len = buff_size;
> + tx_q->tx_skbuff_dma[tx_q->cur_tx].map_as_page = true;
> + tx_q->tx_skbuff_dma[tx_q->cur_tx].buf_type = STMMAC_TXBUF_T_SKB;
> + }
> +
> if (priv->dma_cap.addr64 <= 32)
> desc->des0 = cpu_to_le32(curr_addr);
> else
> stmmac_set_desc_addr(priv, desc, curr_addr);
>
> - buff_size = tmp_len >= TSO_MAX_BUFF_SIZE ?
> - TSO_MAX_BUFF_SIZE : tmp_len;
> -
> stmmac_prepare_tso_tx_desc(priv, desc, 0, buff_size,
> 0, 1,
> (last_segment) && (tmp_len <=
> TSO_MAX_BUFF_SIZE), @@ -4182,6 +4210,7 @@ static void
> stmmac_tso_allocator(struct stmmac_priv *priv, dma_addr_t des,
>
> tmp_len -= TSO_MAX_BUFF_SIZE;
> }
> + return 0;

nit: add a newline before return 0

> }
>
> static void stmmac_flush_tx_descriptors(struct stmmac_priv *priv, int
> queue) @@ -4351,25 +4380,17 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
> pay_len = 0;
> }
>
> - stmmac_tso_allocator(priv, des, tmp_pay_len, (nfrags == 0), queue);
> + if (stmmac_tso_allocator(priv, (skb->data + proto_hdr_len),
> + tmp_pay_len, nfrags == 0, queue, false))
> + goto dma_map_err;

Changing the second argument here is subtly changing the dma_cap.addr64 <= 32 case right before this. Is that intentional?

i.e., prior, pretend des = 0 (side note but des is a very confusing variable name for "dma address" when there's also mentions of desc meaning "descriptor" in the DMA ring). In the <= 32 case, we'd call stmmac_tso_allocator(priv, 0) and in the else case we'd call stmmac_tso_allocator(priv, 0 + proto_hdr_len).

With this change in both cases its called with the (not-yet-dma-mapped)
skb->data + proto_hdr_len always (i.e. like the else case).

Honestly, the <= 32 case reads weird to me without this patch. It seems some of the buffer is filled but des is not properly incremented?

I don't know how this hardware is supposed to be programmed (no databook
access) but that seems fishy (and like a separate bug, which would be nice to squash if so in its own patch). Would you be able to explain the logic there to me if it does make sense to you?

<Suraj> des can not be 0 . des 0 means dma_map_single() failed and it will return.
If we see if des calculation (first->des1 = cpu_to_le32(des + proto_hdr_len);) and else case des calculator ( des += proto_hdr_len;) it is adding proto_hdr_len to the memory that we after mapping skb->data using dma_map_single. Same way we added proto_hdr_len in second argument .

>
> /* Prepare fragments */
> for (i = 0; i < nfrags; i++) {
> - const skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
> + skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
>
> - des = skb_frag_dma_map(priv->device, frag, 0,
> - skb_frag_size(frag),
> - DMA_TO_DEVICE);
> - if (dma_mapping_error(priv->device, des))
> + if (stmmac_tso_allocator(priv, frag, skb_frag_size(frag),
> + (i == nfrags - 1), queue,
> + true))

Personally I think it would be nice to change stmmac_tso_allocator() so you can keep the frag const above... i.e. something like stmmac_tso_allocator(..., void *addr, ..., const skb_frag_t *frag) and just check if frag is NULL to determine if you're dealing with a frag or not (instead of passing the boolean in to indicate that).

I'm curious if someone else can think of a cleaner API than that for that function, even that's not super pretty...

> goto dma_map_err;
> -
> - stmmac_tso_allocator(priv, des, skb_frag_size(frag),
> - (i == nfrags - 1), queue);
> -
> - tx_q->tx_skbuff_dma[tx_q->cur_tx].buf = des;
> - tx_q->tx_skbuff_dma[tx_q->cur_tx].len = skb_frag_size(frag);
> - tx_q->tx_skbuff_dma[tx_q->cur_tx].map_as_page = true;
> - tx_q->tx_skbuff_dma[tx_q->cur_tx].buf_type = STMMAC_TXBUF_T_SKB;
> }
>
> tx_q->tx_skbuff_dma[tx_q->cur_tx].last_segment = true;
> --
> 2.25.1
>