RE: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver

From: Liu, Yi L
Date: Fri Jul 12 2019 - 08:55:38 EST


Hi Alex,

> From: Alex Williamson [mailto:alex.williamson@xxxxxxxxxx]
> Sent: Friday, July 12, 2019 3:08 AM
> To: Liu, Yi L <yi.l.liu@xxxxxxxxx>
> Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
>
> On Thu, 11 Jul 2019 12:27:26 +0000
> "Liu, Yi L" <yi.l.liu@xxxxxxxxx> wrote:
>
> > Hi Alex,
> >
> > > From: kvm-owner@xxxxxxxxxxxxxxx [mailto:kvm-owner@xxxxxxxxxxxxxxx] On
> Behalf
> > > Of Alex Williamson
> > > Sent: Friday, July 5, 2019 11:55 PM
> > > To: Liu, Yi L <yi.l.liu@xxxxxxxxx>
> > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > >
> > > On Thu, 4 Jul 2019 09:11:02 +0000
> > > "Liu, Yi L" <yi.l.liu@xxxxxxxxx> wrote:
> > >
> > > > Hi Alex,
> > > >
> > > > > From: Alex Williamson [mailto:alex.williamson@xxxxxxxxxx]
> > > > > Sent: Thursday, July 4, 2019 1:22 AM
> > > > > To: Liu, Yi L <yi.l.liu@xxxxxxxxx>
> > > > > Subject: Re: [PATCH v1 9/9] smaples: add vfio-mdev-pci driver
> > [...]
> > > >
> > > > > It's really unfortunate that we don't have the mdev inheriting the
> > > > > iommu group of the iommu_device so that userspace can really understand
> > > > > this relationship. A separate group makes sense for the aux-domain
> > > > > case, and is (I guess) not a significant issue in the case of a
> > > > > singleton iommu_device group, but it's pretty awkward here. Perhaps
> > > > > this is something we should correct in design of iommu backed mdevs.
> > > >
> > > > Yeah, for aux-domain case, it is not significant issue as aux-domain essentially
> > > > means singleton iommu_devie group. And in early time, when designing the
> > > support
> > > > for wrap pci as a mdev, we also considered to let vfio-mdev-pci to reuse
> > > > iommu_device group. But this results in an iommu backed group includes mdev
> and
> > > > physical devices, which might also be strange. Do you think it is valuable to
> > > reconsider
> > > > it?
> > >
> > > From a group perspective, the cleanest solution would seem to be that
> > > IOMMU backed mdevs w/o aux domain support should inherit the IOMMU
> > > group of the iommu_device,
> >
> > A confirm here. Regards to inherit the IOMMU group of iommu_device, do
> > you mean mdev device should be added to the IOMMU group of iommu_device
> > or maintain a parent and inheritor relationship within vfio? I guess you mean the
> > later one? :-)
>
> I was thinking the former, I'm not sure what the latter implies. There
> is no hierarchy within or between IOMMU groups, it's simply a set of
> devices.

I have a concern on adding the mdev device to the iommu_group of
iommu_device. In such configuration, a iommu backed group includes
mdev devices and physical devices. Then it might be necessary to advertise
the mdev info to the in-kernel software which want to loop all devices within
such an iommu_group. An example I can see is the virtual SVA threads in
community. e.g. for a guest pasid bind, the changes below loops all the
devices within an iommu_group, and each loop will call into vendor iommu
driver with a device structure passed in. It is quite possible that vendor
iommu driver need to get something behind a physical device (e.g.
intel_iommu structure). For a physical device, it is fine. While for mdev
device, it would be a problem if no mdev info advertised to iommu driver. :-(
Although we have agreement that PASID support should be disabled for
devices which are from non-singleton group. But I don't feel like to rely on
such assumptions when designing software flows. Also, it's just an example,
we have no idea if there will be more similar flows which require to loop all
devices in an iommu group in future. May be we want to avoid adding a mdev
to an iommu backed group. :-) More replies to you response below.

+static long vfio_iommu_type1_bind_gpasid(struct vfio_iommu *iommu,
+ void __user *arg,
+ struct vfio_iommu_type1_bind *bind)
+ ...
+ list_for_each_entry(domain, &iommu->domain_list, next) {
+ list_for_each_entry(group, &domain->group_list, next) {
+ ret = iommu_group_for_each_dev(group->iommu_group,
+ &guest_bind, vfio_bind_gpasid_fn);
+ if (ret)
+ goto out_unbind;
+ }
+ }
+ ...
+}

> Maybe what you're getting at is that vfio needs to understand
> that the mdev is a child of the endpoint device in its determination of
> whether the group is viable.

Is the group here the group of iommu_device or a group of a mdev device?
:-) Actually, I think the group of a mdev device is always viable since
it has only a device and mdev_driver will add the mdev device to vfio
controlled scope to make the mdev group viable. Per my understanding,
VFIO guarantees the isolation by two major arts. First is checking if
group is viable before adding it to a container, second is preventing
multiple opens to /dev/vfio/group_id by the vfio_group->opened field
maintained in vfio.c.

Back to the configuration we are talking here (For example a group where
one devices is bound to a native host driver and the other device bound
to a vfio driver[1].), we have two groups( iommu backed one and mdev group).
I think for iommu_device which wants to "donate" its iommu_group, the
host driver should explicitly call vfio_add_group_dev() to add itself
to the vfio controlled scope. And thus make its iommu backed group be
viable. So that we can have two viable iommu groups. iommu backed group
is viable by the host driver's vfio_add_group_dev() calling, and mdev
group is naturally viable. Until now, we can passthru the devices
(vfio-pci device and a mdev device) under this configuration to VM well.
But we cannot prevent user to passthru the devices to different VMs since
the two iommu groups are both viable. If I'm still understanding vfio
correct until this line, I think we need to fail the attempt of passthru
to multiple VMs in vfio_iommu_type1_attach_group() by checking the
vfio_group->opened field which is maintained in vfio.c. e.g. let's say
for iommu backed group, we have vfio_group#1 and mdev group, we have
vfio_group#2 in vfio.c, then opening vfio_group#1 requires to inc the
vfio_group#2->opened. And vice versa.

[1] the example from the previous reply of you.

> That's true, but we can also have IOMMU
> groups composed of SR-IOV VFs along with their parent PF if the root of
> the IOMMU group is (for example) a downstream switch port above the PF.
> So we can't simply look at the parent/child relationship within the
> group, we somehow need to know that the parent device sharing the IOMMU
> group is operating in host kernel space on behalf of the mdev.

I think for such hardware configuration, we still have only two iommu
group, a iommu backed one and a mdev group. May the idea above still
applicable. :-)

> > > but I think the barrier here is that we have
> > > a difficult time determining if the group is "viable" in that case.
> > > For example a group where one devices is bound to a native host driver
> > > and the other device bound to a vfio driver would typically be
> > > considered non-viable as it breaks the isolation guarantees. However
> >
> > yes, this is how vfio guarantee the isolation before allowing user to further
> > add a group to a vfio container and so on.
> >
> > > I think in this configuration, the parent device is effectively
> > > participating in the isolation and "donating" its iommu group on behalf
> > > of the mdev device. I don't think we can simultaneously use that iommu
> > > group for any other purpose.
> >
> > Agree. At least host cannot make use of the iommu group any more in such
> > configuration.
> >
> > > I'm sure we could come up with a way for
> > > vifo-core to understand this relationship and add it to the white list,
> >
> > The configuration is host driver still exists while we want to let mdev device
> > to somehow "own" the iommu backed DMA isolation capability. So one possible
> > way may be calling vfio_add_group_dev() which will creates a vfio_device instance
> > for the iommu_device in vfio.c when creating a iommu backed mdev. Then the
> > iommu group is fairly viable.
>
> "fairly viable" ;) It's a correct use of the term, it's a little funny
> though as "fairly" can also mean reasonably/sufficiently/adequately as
> well as I think the intended use here equivalent to justly. </tangent>

Aha, a nice "lesson" for me. Honestly, I have no idea how it came to me
when trying to describe my idea with a moderate term either. Luckily,
it made me well understood. :-)

> That's an interesting idea to do an implicit vfio_add_group_dev() on
> the iommu_device in this case, if you've worked through how that could
> play out, it'd be interesting to see.

I've tried it in my vfio-mdev-pci driver probe() phase, it works well.
And this is an explicit calling. And I guess we may really want host driver
to do it explicitly instead of implicitly as host driver owns the choice
of whether "donating" group or not. While for failing the
vfio_iommu_type1_attach_group() to prevent user passthru the vfio-pci device
and vfio-mdev-pci device (share iommu backed group) to different VMs, I'm
doing some changes. If it's a correct way, I'll try to send out a new version
for your further review. :-)

> > > I wonder though how confusing this might be to users who now understand
> > > the group/driver requirement to be "all endpoints bound to vfio
> > > drivers". This might still be the best approach regardless of this.
> >
> > Yes, another thing I'm considering is how to prevent such a host driver from
> > issuing DMA. If we finally get a device bound to vfio-pci and another device
> > wrapped as mdev and passthru them to VM, the host driver is still capable to
> > issue DMA. Though IOMMU can block some DMAs, but not all of them. If a
> > DMA issued by host driver happens to have mapping in IOMMU side, then
> > host is kind of doing things on behalf on VM. Though we may trust the host
> > driver, but it looks to be a little bit awkward to me. :-(
>
> vfio is allocating an iommu domain and placing the iommu_device into
> that domain, the user therefore own the iova context for the parent
> device, how would that not manage all DMA? The vendor driver could
> theoretically also manipulate mappings within that domain, but that
> driver is a host kernel driver and therefore essentially trusted like
> any other host kernel driver. The only unique thing here is that it's
> part of a channel providing access for an untrusted user, so it needs
> to be particularly concerned with keeping that user access within
> bounds. Thanks,

Got it, thanks for the explanation. Looks like I overplayed the concern.

>
> Alex

Regards,
Yi Liu