RE: [REGRESSION] mlx5: Driver remove during hot unplug is broken

From: Parav Pandit
Date: Wed Jul 08 2020 - 06:43:50 EST


Hi Niklas,

> From: Niklas Schnelle <schnelle@xxxxxxxxxxxxx>
> Sent: Monday, June 15, 2020 3:32 PM
>
> Hello Saeed,
>
> On 6/13/20 12:01 AM, Saeed Mahameed wrote:
> > On Fri, 2020-06-12 at 15:09 +0200, Niklas Schnelle wrote:
> >> Hello Parav, Hello Saeed,
> >>
> ... snip ...
> >>
> >> So without really knowing anything about these functions I would
> >> guess that with the device still registered the drained queue does
> >> not remain empty as new entries are added.
> >> Does that sound plausible to you?
> >>
> >
> > I don't think it is related, maybe this is similar to some issues
> > addressed lately by Shay's patches:
> >
> >
> https://patchwork.ozlabs.org/project/netdev/patch/20200611224708.23501
> > 4-2-saeedm@xxxxxxxxxxxx/
> >
> https://patchwork.ozlabs.org/project/netdev/patch/20200611224708.23501
> > 4-3-saeedm@xxxxxxxxxxxx/
> >
> > net/mlx5: drain health workqueue in case of driver load error
> > net/mlx5: Fix fatal error handling during device load
>
> I agree with your similarity assessment especially for the first commit.
> These do not fix the issue though, with mainline v5.8-rc1 which has both I'm
> still getting a hang over 50% of the time with the following detach sequence
> on z/VM:
>
> vmcp detach pcif <mlx_fid>; echo 1 > /proc/cio_settle
>
> Since now the commit 41798df9bfca ("net/mlx5: Drain wq first during PCI
> device removal") no longer reverts cleanly I used the following diff to move
> the mlx5_drain_health_wq(dev) after the mlx5_unregister_devices(dev).
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> index 8b658908f044..63a196fd8e68 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> @@ -1382,8 +1382,8 @@ static void remove_one(struct pci_dev *pdev)
>
> devlink_reload_disable(devlink);
> mlx5_crdump_disable(dev);
> - mlx5_drain_health_wq(dev);
> mlx5_unload_one(dev, true);
> + mlx5_drain_health_wq(dev);
> mlx5_pci_close(dev);
> mlx5_mdev_uninit(dev);
> mlx5_devlink_free(devlink);
>
>
> Note that this changed order also matches the call order in
> mlx5_pci_err_detected().
> With that change I've now done over two dozen detachments with varying
> time between attach and detach to have the driver at different stages of
> initialization.
> With the change all worked without a hitch.
>
> Best regards,
> Niklas Schnelle
> >

Sorry for my late response.
Yes, this looks good and I also found same in my analysis.
With latest code mlx5_pci_close() already does drain_health_wq(), so the additional call in remove_one() is redundant.
It should be just removed.
If you can verify below hunk in your setup, it will be really helpful.
You still need patch 42ea9f1b5c6 in your tree.

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 8b658908f044..ebec2318dbc4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1382,7 +1382,6 @@ static void remove_one(struct pci_dev *pdev)

devlink_reload_disable(devlink);
mlx5_crdump_disable(dev);
- mlx5_drain_health_wq(dev);
mlx5_unload_one(dev, true);
mlx5_pci_close(dev);
mlx5_mdev_uninit(dev);