Re: [PATCH RFCv1 04/14] iommufd: Add struct iommufd_viommu and iommufd_viommu_ops

From: Baolu Lu
Date: Wed May 22 2024 - 05:57:44 EST


On 2024/5/22 16:58, Tian, Kevin wrote:
From: Jason Gunthorpe<jgg@xxxxxxxxxx>
Sent: Tuesday, May 14, 2024 11:56 PM

On Sun, May 12, 2024 at 08:34:02PM -0700, Nicolin Chen wrote:
On Sun, May 12, 2024 at 11:03:53AM -0300, Jason Gunthorpe wrote:
On Fri, Apr 12, 2024 at 08:47:01PM -0700, Nicolin Chen wrote:
Add a new iommufd_viommu core structure to represent a vIOMMU
instance in
the user space, typically backed by a HW-accelerated feature of an
IOMMU,
e.g. NVIDIA CMDQ-Virtualization (an ARM SMMUv3 extension) and
AMD Hardware
Accelerated Virtualized IOMMU (vIOMMU).
I expect this will also be the only way to pass in an associated KVM,
userspace would supply the kvm when creating the viommu.

The tricky bit of this flow is how to manage the S2. It is necessary
that the S2 be linked to the viommu:

1) ARM BTM requires the VMID to be shared with KVM
2) AMD and others need the S2 translation because some of the HW
acceleration is done inside the guest address space

I haven't looked closely at AMD but presumably the VIOMMU create will
have to install the S2 into a DID or something?

So we need the S2 to exist before the VIOMMU is created, but the
drivers are going to need some more fixing before that will fully
work.
Can you elaborate on this point? VIOMMU is a dummy container when
it's created and the association to S2 comes relevant only until when
VQUEUE is created inside and linked to a device? then there should be
a window in between allowing the userspace to configure S2.

Not saying against setting S2 up before vIOMMU creation. Just want
to better understand the rationale here.

Does the nesting domain create need the viommu as well (in place of
the S2 hwpt)? That feels sort of natural.
Yes, I had a similar thought initially: each viommu is backed by
a nested IOMMU HW, and a special HW accelerator like VCMDQ could
be treated as an extension on top of that. It might not be very
straightforward like the current design having vintf<->viommu and
vcmdq <-> vqueue though...
vqueue should be considered a sub object of the viommu and hold a
refcount on the viommu object for its lifetime.

In that case, we can then support viommu_cache_invalidate, which
is quite natural for SMMUv3. Yet, I recall Kevin said that VT-d
doesn't want or need that.
Right, Intel currently doesn't need it, but I feel like everyone will
need this eventually as the fast invalidation path is quite important.

yes, there is no need but I don't see any harm of preparing for such
extension on VT-d. Logically it's clearer, e.g. if we decide to move
device TLB invalidation to a separate uAPI then vIOMMU is certainly
a clearer object to carry it. and hardware extensions really looks like
optimization on software implementations.

and we do need make a decision now, given if we make vIOMMU as
a generic object for all vendors it may have potential impact on
the user page fault support which Baolu is working on. the so-called
fault object will be contained in vIOMMU, which is software managed
on VT-d/SMMU but passed through on AMD. And probably we don't
need another handle mechanism in the attach path, suppose the
vIOMMU object already contains necessary information to find out
iommufd_object for a reported fault.

Yes, if the vIOMMU object tracks all iommufd devices that it manages.

Best regards,
baolu