RE: [PATCH v6 09/10] hisi_acc_vfio_pci: Add support for VFIO live migration
From: Shameerali Kolothum Thodi
Date: Wed Mar 02 2022 - 04:07:46 EST
> -----Original Message-----
> From: Jason Gunthorpe [mailto:jgg@xxxxxxxxxx]
> Sent: 02 March 2022 00:03
> To: Alex Williamson <alex.williamson@xxxxxxxxxx>
> Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@xxxxxxxxxx>;
> kvm@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
> linux-crypto@xxxxxxxxxxxxxxx; cohuck@xxxxxxxxxx; mgurtovoy@xxxxxxxxxx;
> yishaih@xxxxxxxxxx; Linuxarm <linuxarm@xxxxxxxxxx>; liulongfang
> <liulongfang@xxxxxxxxxx>; Zengtao (B) <prime.zeng@xxxxxxxxxxxxx>;
> Jonathan Cameron <jonathan.cameron@xxxxxxxxxx>; Wangzhou (B)
> <wangzhou1@xxxxxxxxxxxxx>
> Subject: Re: [PATCH v6 09/10] hisi_acc_vfio_pci: Add support for VFIO live
> migration
>
> On Tue, Mar 01, 2022 at 03:44:31PM -0700, Alex Williamson wrote:
> > On Tue, 1 Mar 2022 16:39:38 -0400
> > Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
> >
> > > On Tue, Mar 01, 2022 at 12:30:47PM -0700, Alex Williamson wrote:
> > > > Wouldn't it make more sense if initial-bytes started at QM_MATCH_SIZE
> > > > and dirty-bytes was always sizeof(vf_data) - QM_MATCH_SIZE? ie.
> QEMU
> > > > would know that it has sizeof(vf_data) - QM_MATCH_SIZE remaining even
> > > > while it's getting ENOMSG after reading QM_MATCH_SIZE bytes of data.
> > >
> > > The purpose of this ioctl is to help userspace guess when moving on to
> > > STOP_COPY is a good idea ie when the device has done almost all the
> > > work it is going to be able to do in PRE_COPY. ENOMSG is a similar
> > > indicator.
> > >
> > > I expect all devices to have some additional STOP_COPY trailer_data in
> > > addition to their PRE_COPY initial_data and dirty_data
> > >
> > > There is a choice to make if we report the trailer_data during
> > > PRE_COPY or not. As this is all estimates, it doesn't matter unless
> > > the trailer_data is very big.
> > >
> > > Having all devices trend toward a 0 dirty_bytes to say they are are
> > > done all the pre-copy they can do makes sense from an API
> > > perspective. If one device trends toward 10MB due to a big
> > > trailer_data and one trends toward 0 bytes, how will qemu consistently
> > > decide when best to trigger STOP_COPY? It makes the API less useful.
> > >
> > > So, I would not include trailer_data in the dirty_bytes.
> >
> > That assumes that it's possible to keep up with the device dirty
> > rate.
>
> It keeps options open so we have this choice someday.
>
> We already see that implementations are using vCPU throttling as part
> of their migration strategy, and we are seriously looking at DMA
> throttling. It is not a big leap to imagine that
> internal-state-dirtying throttling will happne someday.
>
> With throttling iterations would ratchet up the throttle until they
> reach an absolute small amount of dirty then cut over to STOP_COPY
>
> > It seems like a better approach for userspace would be to look at how
> > dirty_bytes is trending.
>
> It may be biw, but this approach doesn't care if the trailing_bytes
> are included or not, so lets leave them out and preserve the other
> operating model.
>
> > If we exclude STOP_COPY trailing data from the VFIO_DEVICE_MIG_PRECOPY
> > ioctl, it seems even more of a disconnect that when we enter the
> > STOP_COPY state, suddenly we start getting new data out of a PRECOPY
> > ioctl.
>
> Why? That amounts can go up at any time, how does it matter if it goes
> up after STOP_COPY or instantly before?
>
> > BTW, "VFIO_DEVICE" should be reserved for ioctls and data structures
> > relative to the device FD, appending it with _MIG is too subtle for me.
> > This is also a GET operation for INFO, so I'd think for consistency
> > with the existing vfio uAPI we'd name this something like
> > VFIO_MIG_GET_PRECOPY_INFO where the structure might be named
> > vfio_precopy_info.
>
> Sure
>
> > So if we don't think this is the right approach for STOP_COPY, then why
> > are we pushing that it has any purpose outside of PRECOPY or might be
> > implemented by a non-PRECOPY driver for use in STOP_COPY?
>
> It is just simpler and more consistent to implement the math under
> this ioctl in all cases then to try and artificially restrict it.
>
> But I don't have a use case for it, so lets block it if you prefer.
>
> Shameerali will you make these adjustments to the PRE_COPY patch?
Sure. I think we can summarize the discussion as below,
- Rename the MIG_PRECOPY ioctl to VFIO_MIG_GET_PRECOPY_INFO and
structure to vfio_precopy_info.
- This ioctl is only valid in PRE_COPY state and should return -EINVAL in
other states(Update the documentation).
- No changes to the initial_bytes & dirty_bytes descriptions.
Please let me know if I missed anything.
I will address other comments on this series as well and sent out a
revised one soon.
Thanks,
Shameer