RE: [EXTERNAL] Re: [PATCH V4 net] net: mana: Fix MANA VF unload when host is unresponsive

From: Haiyang Zhang
Date: Tue Jul 04 2023 - 09:42:40 EST




> -----Original Message-----
> From: Souradeep Chakrabarti <schakrabarti@xxxxxxxxxxxxx>
> Sent: Monday, July 3, 2023 3:55 PM
> To: Alexander Lobakin <aleksander.lobakin@xxxxxxxxx>; souradeep chakrabarti
> <schakrabarti@xxxxxxxxxxxxxxxxxxx>
> Cc: KY Srinivasan <kys@xxxxxxxxxxxxx>; Haiyang Zhang
> <haiyangz@xxxxxxxxxxxxx>; wei.liu@xxxxxxxxxx; Dexuan Cui
> <decui@xxxxxxxxxxxxx>; davem@xxxxxxxxxxxxx; edumazet@xxxxxxxxxx;
> kuba@xxxxxxxxxx; pabeni@xxxxxxxxxx; Long Li <longli@xxxxxxxxxxxxx>; Ajay
> Sharma <sharmaajay@xxxxxxxxxxxxx>; leon@xxxxxxxxxx;
> cai.huoqing@xxxxxxxxx; ssengar@xxxxxxxxxxxxxxxxxxx; vkuznets@xxxxxxxxxx;
> tglx@xxxxxxxxxxxxx; linux-hyperv@xxxxxxxxxxxxxxx; netdev@xxxxxxxxxxxxxxx;
> linux-kernel@xxxxxxxxxxxxxxx; linux-rdma@xxxxxxxxxxxxxxx;
> stable@xxxxxxxxxxxxxxx
> Subject: RE: [EXTERNAL] Re: [PATCH V4 net] net: mana: Fix MANA VF unload
> when host is unresponsive
>
>
>
> >-----Original Message-----
> >From: Alexander Lobakin <aleksander.lobakin@xxxxxxxxx>
> >Sent: Monday, July 3, 2023 10:18 PM
> >To: souradeep chakrabarti <schakrabarti@xxxxxxxxxxxxxxxxxxx>
> >Cc: KY Srinivasan <kys@xxxxxxxxxxxxx>; Haiyang Zhang
> ><haiyangz@xxxxxxxxxxxxx>; wei.liu@xxxxxxxxxx; Dexuan Cui
> ><decui@xxxxxxxxxxxxx>; davem@xxxxxxxxxxxxx; edumazet@xxxxxxxxxx;
> >kuba@xxxxxxxxxx; pabeni@xxxxxxxxxx; Long Li <longli@xxxxxxxxxxxxx>; Ajay
> >Sharma <sharmaajay@xxxxxxxxxxxxx>; leon@xxxxxxxxxx;
> >cai.huoqing@xxxxxxxxx; ssengar@xxxxxxxxxxxxxxxxxxx; vkuznets@xxxxxxxxxx;
> >tglx@xxxxxxxxxxxxx; linux-hyperv@xxxxxxxxxxxxxxx; netdev@xxxxxxxxxxxxxxx;
> >linux-kernel@xxxxxxxxxxxxxxx; linux-rdma@xxxxxxxxxxxxxxx;
> >stable@xxxxxxxxxxxxxxx; Souradeep Chakrabarti
> <schakrabarti@xxxxxxxxxxxxx>
> >Subject: [EXTERNAL] Re: [PATCH V4 net] net: mana: Fix MANA VF unload when
> >host is unresponsive
> >
> >From: Souradeep Chakrabarti <schakrabarti@xxxxxxxxxxxxxxxxxxx>
> >Date: Mon, 3 Jul 2023 01:49:31 -0700
> >
> >> From: Souradeep Chakrabarti <schakrabarti@xxxxxxxxxxxxxxxxxxx>
> >
> >Please sync your Git name and Git mail account settings, so that your own
> >patches won't have "From:" when sending. From what I see, you need to
> >correct first letters of name and surname to capital in the Git email settings
> >block.
> Thank you for pointing, I will fix it.
> >
> >>
> >> When unloading the MANA driver, mana_dealloc_queues() waits for the
> >> MANA hardware to complete any inflight packets and set the pending
> >> send count to zero. But if the hardware has failed,
> >> mana_dealloc_queues() could wait forever.
> >>
> >> Fix this by adding a timeout to the wait. Set the timeout to 120
> >> seconds, which is a somewhat arbitrary value that is more than long
> >> enough for functional hardware to complete any sends.
> >>
> >> Signed-off-by: Souradeep Chakrabarti
> >> <schakrabarti@xxxxxxxxxxxxxxxxxxx>
> >
> >Where's "Fixes:" tagging the blamed commit?
> This is present from the day zero of the mana driver code.
> It has not been introduced in the code by any commit.
> >
> >> ---
> >> V3 -> V4:
> >> * Fixed the commit message to describe the context.
> >> * Removed the vf_unload_timeout, as it is not required.
> >> ---
> >> drivers/net/ethernet/microsoft/mana/mana_en.c | 26
> >> ++++++++++++++++---
> >> 1 file changed, 23 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c
> >> b/drivers/net/ethernet/microsoft/mana/mana_en.c
> >> index a499e460594b..d26f1da70411 100644
> >> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> >> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> >> @@ -2346,7 +2346,10 @@ static int mana_dealloc_queues(struct
> >> net_device *ndev) {
> >> struct mana_port_context *apc = netdev_priv(ndev);
> >> struct gdma_dev *gd = apc->ac->gdma_dev;
> >> + unsigned long timeout;
> >> struct mana_txq *txq;
> >> + struct sk_buff *skb;
> >> + struct mana_cq *cq;
> >> int i, err;
> >>
> >> if (apc->port_is_up)
> >> @@ -2363,15 +2366,32 @@ static int mana_dealloc_queues(struct
> >net_device *ndev)
> >> * to false, but it doesn't matter since mana_start_xmit() drops any
> >> * new packets due to apc->port_is_up being false.
> >> *
> >> - * Drain all the in-flight TX packets
> >> + * Drain all the in-flight TX packets.
> >> + * A timeout of 120 seconds for all the queues is used.
> >> + * This will break the while loop when h/w is not responding.
> >> + * This value of 120 has been decided here considering max
> >> + * number of queues.
> >> */
> >> +
> >> + timeout = jiffies + 120 * HZ;
> >
> >Why not initialize it right when declaring?
> I will fix it in next version.
> >
> >> for (i = 0; i < apc->num_queues; i++) {
> >> txq = &apc->tx_qp[i].txq;
> >> -
> >> - while (atomic_read(&txq->pending_sends) > 0)
> >> + while (atomic_read(&txq->pending_sends) > 0 &&
> >> + time_before(jiffies, timeout)) {
> >> usleep_range(1000, 2000);> + }
> >> }
> >
> >120 seconds by 2 msec step is 60000 iterations, by 1 msec is 120000
> >iterations. I know usleep_range() often is much less precise, but still.
> >Do you really need that much time? Has this been measured during the tests
> >that it can take up to 120 seconds or is it just some random value that "should
> >be enough"?
> >If you really need 120 seconds, I'd suggest using a timer / delayed work
> instead
> >of wasting resources.
> Here the intent is not waiting for 120 seconds, rather than avoid continue
> checking the
> pending_sends of each tx queues for an indefinite time, before freeing
> sk_buffs.
> The pending_sends can only get decreased for a tx queue, if mana_poll_tx_cq()
> gets called for a completion notification and increased by xmit.
>
> In this particular bug, apc->port_is_up is not set to false, causing
> xmit to keep increasing the pending_sends for the queue and
> mana_poll_tx_cq()
> not getting called for the queue.
>
> If we see the comment in the function mana_dealloc_queues(), it mentions it :
>
> 2346 /* No packet can be transmitted now since apc->port_is_up is false.
> 2347 * There is still a tiny chance that mana_poll_tx_cq() can re-enable
> 2348 * a txq because it may not timely see apc->port_is_up being cleared
> 2349 * to false, but it doesn't matter since mana_start_xmit() drops any
> 2350 * new packets due to apc->port_is_up being false.
>
> The value 120 seconds has been decided here based on maximum number of
> queues
> are allowed in this specific hardware, it is a safe assumption.

I agree. Also, this waiting time is usually much shorter than 120 sec. The long
wait only happens in rare and unexpected NIC HW non-responding cases. To
further reduce the resource consumption, we can double the usleep_range()
time in every iteration. So, the number of iterations will be greatly reduced
before reaching 120 sec.

Thanks,
- Haiyang