Re: [PATCH v3 6/6] hisi_acc_vfio_pci: Add support for VFIO live migration

From: Leon Romanovsky
Date: Mon Sep 27 2021 - 14:17:37 EST


On Mon, Sep 27, 2021 at 01:06:27PM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 27, 2021 at 07:00:23PM +0300, Leon Romanovsky wrote:
> > On Mon, Sep 27, 2021 at 12:01:19PM -0300, Jason Gunthorpe wrote:
> > > On Mon, Sep 27, 2021 at 01:46:31PM +0000, Shameerali Kolothum Thodi wrote:
> > >
> > > > > > > Nope, this is locked wrong and has no lifetime management.
> > > > > >
> > > > > > Ok. Holding the device_lock() sufficient here?
> > > > >
> > > > > You can't hold a hisi_qm pointer with some kind of lifecycle
> > > > > management of that pointer. device_lock/etc is necessary to call
> > > > > pci_get_drvdata()
> > > >
> > > > Since this migration driver only supports VF devices and the PF
> > > > driver will not be removed until all the VF devices gets removed,
> > > > is the locking necessary here?
> > >
> > > Oh.. That is really busted up. pci_sriov_disable() is called under the
> > > device_lock(pf) and obtains the device_lock(vf).
> >
> > Yes, indirectly, but yes.
> >
> > >
> > > This means a VF driver can never use the device_lock(pf), otherwise it
> > > can deadlock itself if PF removal triggers VF removal.
> >
> > VF can use pci_dev_trylock() on PF to prevent PF removal.
>
> no, no here, the device_lock is used in too many places for a trylock
> to be appropriate
>
> > >
> > > But you can't access these members without using the device_lock(), as
> > > there really are no safety guarentees..
> > >
> > > The mlx5 patches have this same sketchy problem.
> > >
> > > We may need a new special function 'pci_get_sriov_pf_devdata()' that
> > > confirms the vf/pf relationship and explicitly interlocks with the
> > > pci_sriov_enable/disable instead of using device_lock()
> > >
> > > Leon, what do you think?
> >
> > I see pci_dev_lock() and similar functions, they are easier to
> > understand that specific pci_get_sriov_pf_devdata().
>
> That is just a wrapper for device_lock - it doesnt help anything
>
> The point is to all out a different locking regime that relies on the
> sriov enable/disable removing the VF struct devices

You can't avoid trylock, because this pci_get_sriov_pf_devdata() will be
called in VF where it already holds lock, so attempt to take PF lock
will cause to deadlock.

PCI code assumes that PF lock is taken first, and VF lock is second.

Thanks

>
> Jason