Re: [PATCH net] net: ethernet: ti: am65-cpsw: fix freeing IRQ in am65_cpsw_nuss_remove_tx_chns()

From: Roger Quadros
Date: Wed Jan 15 2025 - 11:39:25 EST


Siddharth,

On 15/01/2025 17:49, Roger Quadros wrote:
> Hi Siddharth,
>
> On 15/01/2025 12:38, Siddharth Vadapalli wrote:
>> On Wed, Jan 15, 2025 at 12:04:17PM +0200, Roger Quadros wrote:
>>> Hi Siddharth,
>>>
>>> On 15/01/2025 07:18, Siddharth Vadapalli wrote:
>>>> On Tue, Jan 14, 2025 at 06:44:02PM +0200, Roger Quadros wrote:
>>>>
>>>> Hello Roger,
>>>>
>>>>> When getting the IRQ we use k3_udma_glue_rx_get_irq() which returns
>>>>
>>>> You probably meant "k3_udma_glue_tx_get_irq()" instead? It is used to
>>>> assign tx_chn->irq within am65_cpsw_nuss_init_tx_chns() as follows:
>>>
>>> Yes I meant tx instead of rx.
>>>
>>>>
>>>> tx_chn->irq = k3_udma_glue_tx_get_irq(tx_chn->tx_chn);
>>>>
>>>> Additionally, following the above section we have:
>>>>
>>>> if (tx_chn->irq < 0) {
>>>> dev_err(dev, "Failed to get tx dma irq %d\n",
>>>> tx_chn->irq);
>>>> ret = tx_chn->irq;
>>>> goto err;
>>>> }
>>>>
>>>> Could you please provide details on the code-path which will lead to a
>>>> negative "tx_chn->irq" within "am65_cpsw_nuss_remove_tx_chns()"?
>>>>
>>>> There seem to be two callers of am65_cpsw_nuss_remove_tx_chns(), namely:
>>>> 1. am65_cpsw_nuss_update_tx_rx_chns()
>>>> 2. am65_cpsw_nuss_suspend()
>>>> Since both of them seem to invoke am65_cpsw_nuss_remove_tx_chns() only
>>>> in the case where am65_cpsw_nuss_init_tx_chns() *did not* error out, it
>>>> appears to me that "tx_chn->irq" will never be negative within
>>>> am65_cpsw_nuss_remove_tx_chns()
>>>>
>>>> Please let me know if I have overlooked something.
>>>
>>> The issue is with am65_cpsw_nuss_update_tx_rx_chns(). It can be called
>>> repeatedly (by user changing number of TX queues) even if previous call
>>> to am65_cpsw_nuss_init_tx_chns() failed.
>>
>> Thank you for clarifying. So the issue/bug was discovered since the
>> implementation of am65_cpsw_nuss_update_tx_rx_chns(). The "Fixes" tag
>> misled me. Maybe the "Fixes" tag should be updated? Though we should
>> code to future-proof it as done in this patch, the "Fixes" tag pointing
>> to the very first commit of the driver might not be accurate as the
>> code-path associated with the bug cannot be exercised at that commit.
>
> Fair enough. I'll change the Fixes commit.

Now that I check the code again, am65_cpsw_nuss_remove_tx_chns(),
am65_cpsw_nuss_update_tx_chns() and am65_cpsw_nuss_init_tx_chns()
were all introduced in the Fixes commit I stated.

Could you please share why you thought it is not accurate?

>
>>
>> Independent of the above change suggested for the "Fixes" tag,
>>
>> Reviewed-by: Siddharth Vadapalli <s-vadapalli@xxxxxx>
>>
>> There seems to be a different bug in am65_cpsw_nuss_update_tx_rx_chns()
>> which I have described below.
>>
>>>
>>> Please try the below patch to simulate the error condition.
>>>
>>> Then do the following
>>> - bring down all network interfaces
>>> - change num TX queues to 2. IRQ for 2nd channel fails.
>>> - change num TX queues to 3. Now we try to free an invalid IRQ and we see warning.
>>>
>>> Also I think it is good practice to code for set value than to code
>>> for existing code paths as they can change in the future.
>>>
>>> --test patch starts--
>>>
>>> diff --git a/drivers/net/ethernet/ti/am65-cpsw-nuss.c b/drivers/net/ethernet/ti/am65-cpsw-nuss.c
>>> index 36c29d3db329..c22cadaaf3d3 100644
>>> --- a/drivers/net/ethernet/ti/am65-cpsw-nuss.c
>>> +++ b/drivers/net/ethernet/ti/am65-cpsw-nuss.c
>>> @@ -155,7 +155,7 @@
>>> NETIF_MSG_IFUP | NETIF_MSG_PROBE | NETIF_MSG_IFDOWN | \
>>> NETIF_MSG_RX_ERR | NETIF_MSG_TX_ERR)
>>>
>>> -#define AM65_CPSW_DEFAULT_TX_CHNS 8
>>> +#define AM65_CPSW_DEFAULT_TX_CHNS 1
>>> #define AM65_CPSW_DEFAULT_RX_CHN_FLOWS 1
>>>
>>> /* CPPI streaming packet interface */
>>> @@ -2346,7 +2348,10 @@ static int am65_cpsw_nuss_init_tx_chns(struct am65_cpsw_common *common)
>>> tx_chn->dsize_log2 = __fls(hdesc_size_out);
>>> WARN_ON(hdesc_size_out != (1 << tx_chn->dsize_log2));
>>>
>>> - tx_chn->irq = k3_udma_glue_tx_get_irq(tx_chn->tx_chn);
>>> + if (i == 1)
>>> + tx_chn->irq = -ENODEV;
>>> + else
>>> + tx_chn->irq = k3_udma_glue_tx_get_irq(tx_chn->tx_chn);
>>
>> The pair - am65_cpsw_nuss_init_tx_chns()/am65_cpsw_nuss_remove_tx_chns()
>> seem to be written under the assumption that failure will result in the
>> driver's probe failing.
>>
>> With am65_cpsw_nuss_update_tx_rx_chns(), that assumption no longer holds
>> true. Please consider the following sequence:
>>
>> 1.
>> am65_cpsw_nuss_probe()
>> am65_cpsw_nuss_register_ndevs()
>> am65_cpsw_nuss_init_tx_chns() => Succeeds
>>
>> 2.
>> Probe is successful
>>
>> 3.
>> am65_cpsw_nuss_update_tx_rx_chns() => Invoked by user
>> am65_cpsw_nuss_remove_tx_chns() => Succeeds
>> am65_cpsw_nuss_init_tx_chns() => Partially fails
>> devm_add_action(dev, am65_cpsw_nuss_free_tx_chns, common);
>> ^ DEVM Action is added, but since the driver isn't removed,
>> the cleanup via am65_cpsw_nuss_free_tx_chns() will not run.
>>
>> Only when the user re-invokes am65_cpsw_nuss_update_tx_rx_chns(),
>> the cleanup will be performed. This might have to be fixed in the
>> following manner:
>>
>> @@ -3416,10 +3416,17 @@ int am65_cpsw_nuss_update_tx_rx_chns(struct am65_cpsw_common *common,
>> common->tx_ch_num = num_tx;
>> common->rx_ch_num_flows = num_rx;
>> ret = am65_cpsw_nuss_init_tx_chns(common);
>> - if (ret)
>> + if (ret) {
>> + devm_remove_action(dev, am65_cpsw_nuss_free_tx_chns, common);
>> + am65_cpsw_nuss_free_tx_chns(common);
>> return ret;
>> + }
>>
>> ret = am65_cpsw_nuss_init_rx_chns(common);
>> + if (ret) {
>> + devm_remove_action(dev, am65_cpsw_nuss_free_rx_chns, common);
>> + am65_cpsw_nuss_free_rx_chns(common);
>> + }
>>
>> return ret;
>> }
>>
>> Please let me know what you think.
>
> I've already implemented a cleanup series to get rid of devm_add/remove_action,
> cleanup probe error path and streamline TX and RQ queue init/cleanup.
> I'll send out the series soon as soon as I finish some tests.
>

--
cheers,
-roger