Re: [net-next] net: mana: Trigger VF reset/recovery on health check failure due to HWC timeout

From: Dipayaan Roy

Date: Fri Feb 27 2026 - 03:08:04 EST

On Tue, Feb 24, 2026 at 08:26:41AM +0000, Simon Horman wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
>
> For local reproduction steps see https://netdev-ai.bots.linux.dev/ai-local.html
> ---
> net: mana: Trigger VF reset/recovery on health check failure due to HWC timeout
>
> The GF stats periodic query is used as mechanism to monitor HWC health
> check. If this HWC command times out, it is a strong indication that
> the device/SoC is in a faulty state and requires recovery.
>
> Today, when a timeout is detected, the driver marks
> hwc_timeout_occurred, clears cached stats, and stops rescheduling the
> periodic work. However, the device itself is left in the same failing
> state.
>
> Extend the timeout handling path to trigger the existing MANA VF
> recovery service by queueing a GDMA_EQE_HWC_RESET_REQUEST work item.
> This is expected to initiate the appropriate recovery flow by suspende
> resume first and if it fails then trigger a bus rescan.
>
> This change is intentionally limited to HWC command timeouts and does
> not trigger recovery for errors reported by the SoC as a normal command
> response.
>
> > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > index 0055c231acf6..16c438d2aaa3 100644
> > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
>
> [ ... ]
>
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > index 9b5a72ada5c4..e6a4034b40f9 100644
> > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
>
> [ ... ]
>
> > @@ -3530,6 +3530,8 @@ static void mana_gf_stats_work_handler(struct work_struct *work)
> > {
> > struct mana_context *ac =
> > container_of(to_delayed_work(work), struct mana_context, gf_stats_work);
> > + struct gdma_context *gc = ac->gdma_dev->gdma_context;
> > + struct mana_serv_work *mns_wk;
> > int err;
> >
> > err = mana_query_gf_stats(ac);
> > @@ -3537,6 +3539,30 @@ static void mana_gf_stats_work_handler(struct work_struct *work)
> > /* HWC timeout detected - reset stats and stop rescheduling */
> > ac->hwc_timeout_occurred = true;
> > memset(&ac->hc_stats, 0, sizeof(ac->hc_stats));
> > + dev_warn(gc->dev,
> > + "Gf stats wk handler: gf stats query timed out.\n");
> > +
> > + /* As HWC timed out, indicating a faulty HW state and needs a
> > + * reset.
> > + */
> > + if (!test_and_set_bit(GC_IN_SERVICE, &gc->flags)) {
> > + if (!try_module_get(THIS_MODULE)) {
> > + dev_info(gc->dev, "Module is unloading\n");
> > + return;
> > + }
> > +
> > + mns_wk = kzalloc(sizeof(*mns_wk), GFP_ATOMIC);
> > + if (!mns_wk) {
> > + module_put(THIS_MODULE);
> > + return;
> > + }
>
> Does this leave the GC_IN_SERVICE flag permanently set? When
> test_and_set_bit() atomically sets the bit and then try_module_get()
> fails or kzalloc() fails, the function returns without clearing
> GC_IN_SERVICE.
>
> This would permanently block all future recovery attempts. Looking at
> mana_gd_process_eqe() in gdma_main.c:
>
ACK.
> if (test_bit(GC_IN_SERVICE, &gc->flags)) {
> dev_info(gc->dev, "Already in service\n");
> break;
> }
>
> It would detect the set bit and skip the reset. Similarly, mana_tx_timeout()
> in mana_en.c would also short-circuit:
>
> if (test_bit(GC_IN_SERVICE, &gc->flags))
> return;
>
> The device cannot recover without manual intervention since the flag is only
> cleared by mana_serv_reset() at the end of the service work:
>
> drivers/net/ethernet/microsoft/mana/gdma_main.c:mana_serv_reset()
> {
> ...
> out:
> clear_bit(GC_IN_SERVICE, &gc->flags);
> }
>
> But since the work is never scheduled in these error paths, clear_bit() is
> never called.
>
> Note that the existing mana_gd_process_eqe() path avoids this problem by
> calling set_bit(GC_IN_SERVICE) only AFTER all allocations succeed:
>
> drivers/net/ethernet/microsoft/mana/gdma_main.c:mana_gd_process_eqe()
> {
> ...
> if (test_bit(GC_IN_SERVICE, &gc->flags)) {
> dev_info(gc->dev, "Already in service\n");
> break;
> }
>
> if (!try_module_get(THIS_MODULE)) {
> ...
> break;
> }
>
> mns_wk = kzalloc(sizeof(*mns_wk), GFP_KERNEL);
> if (!mns_wk) {
> module_put(THIS_MODULE);
> break;
> }
>
> dev_info(gc->dev, "Start MANA service type:%d\n", type);
> set_bit(GC_IN_SERVICE, &gc->flags); // Set only after success
> ...
> }
>
> Should clear_bit(GC_IN_SERVICE, &gc->flags) be added before each error
> return?
>
ACK.
> > +
> > + mns_wk->pdev = to_pci_dev(gc->dev);
> > + mns_wk->type = GDMA_EQE_HWC_RESET_REQUEST;
> > + pci_dev_get(mns_wk->pdev);
> > + INIT_WORK(&mns_wk->serv_work, mana_serv_func);
> > + schedule_work(&mns_wk->serv_work);
> > + }
>
> This isn't a bug, but the recovery initiation sequence here is a
> near-duplicate of the pattern in mana_gd_process_eqe() (gdma_main.c).
> Both perform the identical 7-step sequence: try_module_get, kzalloc
> mana_serv_work, set pdev, set type, pci_dev_get, INIT_WORK, schedule_work.
>
> The duplication already contributed to the GC_IN_SERVICE error path issue
> above, where this new copy diverged from the original in its error handling.
>
> Consider extracting a common helper to avoid divergence.
>
ACK.
> > return;
> > }
> > schedule_delayed_work(&ac->gf_stats_work, MANA_GF_STATS_PERIOD);
>
> [ ... ]

Thanks Simon, I am addressing these in v2.

Regards