Re: [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION

From: Yan Zhao
Date: Mon Jun 01 2020 - 03:07:28 EST

On Fri, May 29, 2020 at 03:45:47PM -0600, Alex Williamson wrote:
> On Sun, 17 May 2020 22:52:45 -0400
> Yan Zhao <yan.y.zhao@xxxxxxxxx> wrote:
> > This is a virtual irq type.
> > vendor driver triggers this irq when it wants to notify userspace to
> > remap PCI BARs.
> >
> > 1. vendor driver triggers this irq and packs the target bar number in
> > the ctx count. i.e. "1 << bar_number".
> > if a bit is set, the corresponding bar is to be remapped.
> >
> > 2. userspace requery the specified PCI BAR from kernel and if flags of
> > the bar regions are changed, it removes the old subregions and attaches
> > subregions according to the new flags.
> >
> > 3. userspace notifies back to kernel by writing one to the eventfd of
> > this irq.
> >
> > Please check the corresponding qemu implementation from the reply of this
> > patch, and a sample usage in vendor driver in patch [10/10].
> >
> > Cc: Kevin Tian <kevin.tian@xxxxxxxxx>
> > Signed-off-by: Yan Zhao <yan.y.zhao@xxxxxxxxx>
> > ---
> > include/uapi/linux/vfio.h | 11 +++++++++++
> > 1 file changed, 11 insertions(+)
> >
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 2d0d85c7c4d4..55895f75d720 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -704,6 +704,17 @@ struct vfio_irq_info_cap_type {
> > __u32 subtype; /* type specific */
> > };
> >
> > +/* Bar Region Query IRQ TYPE */
> > +
> > +/* sub-types for VFIO_IRQ_TYPE_REMAP_BAR_REGION */
> > +/*
> > + * This irq notifies userspace to re-query BAR region and remaps the
> > + * subregions.
> > + */
> Hi Yan,
> How do we do this in a way that's backwards compatible? Or maybe, how
> do we perform a handshake between the vendor driver and userspace to
> indicate this support?
hi Alex
thank you for your thoughtful review!

do you think below sequence can provide enough backwards compatibility?

- on vendor driver opening, it registers an irq of type
VFIO_IRQ_TYPE_REMAP_BAR_REGION, and reports to driver vfio-pci there's
1 vendor irq.

- after userspace detects the irq of type VFIO_IRQ_TYPE_REMAP_BAR_REGION
it enables it by signaling ACTION_TRIGGER.

- on receiving this ACTION_TRIGGER, vendor driver will try to setup a
virqfd to monitor file write to the fd of this irq, enable this irq
and return its enabling status to userspace.

> Would the vendor driver refuse to change
> device_state in the migration region if the user has not enabled this
> IRQ?
yes, vendor driver can refuse to change device_state if the irq
in my sample i40e_vf driver (patch 10/10), it implemented this logic
like below:

| ret = i40e_vf_prepare_dirty_track(i40e_vf_dev);
|->ret = i40e_vf_remap_bars(i40e_vf_dev, true);
|->if (!i40e_vf_dev->remap_irq_ctx.init)
return -ENODEV;

(i40e_vf_dev->remap_irq_ctx.init is set in below path)
|-> vf_dev->remap_irq_ctx.init = true;

> Everything you've described in the commit log needs to be in this
> header, we can't have the usage protocol buried in a commit log. It
got it! I'll move all descriptions in commit logs to this header so that
readers can understand the whole picture here.

> also seems like this is unnecessarily PCI specific. Can't the count
> bitmap simply indicate the region index to re-evaluate? Maybe you were
yes, it is possible. but what prevented me from doing it is that it's not
easy to write an irq handler in qemu to remap other regions dynamically.

for BAR regions, there're 3 layers as below.
1. bar->mr -->bottom layer
2. bar->region.mem --> slow path
3. bar->region->mmaps[i].mem --> fast path
so, bar remap irq handler can simply re-revaluate the region and
remove/re-generate the layer 3 (fast path) without losing track of any
guest accesses to the bar regions.

actually so far, the bar remap irq handler in qemu only supports remap
mmap'd subregions (layout of mmap'd subregions are re-queried) and
not supports updating the whole bar region size.
(do you think updating bar region size is a must?)

however, there are no such fast path and slow path in other regions, so
remap handlers for them are region specific.

> worried about running out of bits in the ctx count? An IRQ per region
yes. that's also possible :)
but current ctx count is 64bit, so it can support regions of index up to 63.
if we don't need to remap dev regions, seems it's enough?

> could resolve that, but maybe we could also just add another IRQ for
> the next bitmap of regions. I assume that the bitmap can indicate
> multiple regions to re-evaluate, but that should be documented.
hmm. would you mind elaborating more about it?

> Also, what sort of service requirements does this imply? Would the
> vendor driver send this IRQ when the user tries to set the device_state
> to _SAVING and therefore we'd require the user to accept, implement the
> mapping change, and acknowledge the IRQ all while waiting for the write
> to device_state to return? That implies quite a lot of asynchronous
> support in the userspace driver. Thanks,
(1) when user sets device_state to _SAVING, the vendor driver notifies this
IRQ, waits until user IRQ ack is received.
(2) in IRQ handler, user decodes and sends IRQ ack to vendor driver.

if a wait is required in (1) returns, it demands the qemu_mutex_iothread is
not locked in migration thread when device_state is set in (1), as before
entering (2), acquiring of this mutex is required.

Currently, this lock is not hold in vfio_migration_set_state() at
save_setup stage but is hold in stop and copy stage. so we wait in
kernel in save_setup stage and not wait in stop stage.
it can be fixed by calling qemu_mutex_unlock_iothread() on entering
vfio_migration_set_state() and qemu_mutex_lock_iothread() on leaving
vfio_migration_set_state() in qemu.

do you think it's acceptable?

> > +
> > +
> > /**
> > * VFIO_DEVICE_SET_IRQS - _IOW(VFIO_TYPE, VFIO_BASE + 10, struct vfio_irq_set)
> > *