RE: [PATCH RFCv1 04/14] iommufd: Add struct iommufd_viommu and iommufd_viommu_ops
From: Tian, Kevin
Date: Wed May 22 2024 - 21:43:57 EST
> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Wednesday, May 22, 2024 9:39 PM
>
> On Wed, May 22, 2024 at 08:58:34AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> > > Sent: Tuesday, May 14, 2024 11:56 PM
> > >
> > > > > So we need the S2 to exist before the VIOMMU is created, but the
> > > > > drivers are going to need some more fixing before that will fully
> > > > > work.
> >
> > Can you elaborate on this point? VIOMMU is a dummy container when
> > it's created and the association to S2 comes relevant only until when
> > VQUEUE is created inside and linked to a device?
>
> VIOMMU contains:
> - A nesting parent
> - A KVM
> - Any global per-VM data the driver needs
> * In ARM case this is VMID, sometimes shared with KVM
In which case is it not shared with KVM? I had the impression that
VMID always comes from KVM in this VCMDQ usage. 😊
> * In AMD case this is will allocate memory in the
> "viommu backing storage memory"
>
> Objects can be created on top of a VIOMMU:
> - A nested HWPT (iommu_hwpt_alloc::pt_id can be a viommu)
> - A vqueue (ARM/AMD)
> - Other AMD virtualized objects (EventLog, PPRLog)
>
> It is desirable to keep the VIOMMU linked to only a single nesting
> parent that never changes. Given it seems to be a small ask to
> allocate the nesting parent before the VIOMMU providing it at VIOMMU
> creation time looks like it will simplify the drivers because they can
> rely on it always existing and never changing.
yes, this makes sense.
>
> I think this lends itself to a logical layered VMM design..
>
> - If VFIO is being used get an iommufd
> - Allocate an IOAS for the entire guest physical
> - Determine the vIOMMU driver to use
> - Allocate a HWPT for use by all the vIOMMU instances
> - Allocate a VIOMMU per vIOMMU instance
>
> On ARM the S2 is not divorced from the VIOMMU, ARM requires a single
> VMID, shared with KVM, and localized to a single VM for some of the
> bypass features (vBTM, vCMDQ). So to attach a S2 you actually have to
> attach the VIOMMU to pick up the correct VMID.
>
> I imagine something like this:
> hwpt_alloc(deva, nesting_parent=true) = shared_s2
> viommu_alloc(deva, shared_s2) = viommu1
> viommu_alloc(devb, shared_s2) = viommu2
> hwpt_alloc(deva, viommu1, vste) = deva_vste
> hwpt_alloc(devb, viommu2, vste) = devb_vste
> attach(deva, deva_vste)
> attach(devb, devb_vste)
> attach(devc, shared_s2)
I wonder whether we want to make viommu as the 1st-class citizen
for any nested hwpt if it is desirable to enable it even for VT-d which
lacks of a hw viommu concept at the moment.
>
> The driver will then know it should program three different VMIDs for
> the same S2 page table, which matches the ARM expectation for
> VMID. That is to say we'd pass in the viommu as the pt_id for the
> iommu_hwpt_alloc. The viommu would imply both the S2 page table and
> any meta information like VMID the driver needs.
Can you elaborate the aspect about "three different VMIDs"? They are
all for the same VM hence sharing the same VMID per the earlier
description. This is also echo-ed in patch14:
tegra241_cmdqv_viommu_alloc()
vintf->vmid = smmu_domain->vmid;
>
> Both AMD and the vCMDQ thing need to translate some PFNs through the
> S2 and program them elsewhere, this is manually done by SW, and there
> are three choices I guess:
> - Have the VMM do it and provide void __user * to the driver
this sounds redundant to what S2 already provides
> - Have the driver do it through the S2 directly and track
> S2 invalidations
this makes more sense to me. Just like the driver already needs to track
S2 invalidations to flush any nested cache related to the affected S2 range.
> - Have the driver open an access on the IOAS and use the access unmap
>
it requires adding more iommufd awareness into the iommu driver. I'm
inclined to do it only at minimal necessity.
> Not sure which is the best..
>
> > > Right, Intel currently doesn't need it, but I feel like everyone will
> > > need this eventually as the fast invalidation path is quite important.
> >
> > yes, there is no need but I don't see any harm of preparing for such
> > extension on VT-d. Logically it's clearer, e.g. if we decide to move
> > device TLB invalidation to a separate uAPI then vIOMMU is certainly
> > a clearer object to carry it. and hardware extensions really looks like
> > optimization on software implementations.
> >
> > and we do need make a decision now, given if we make vIOMMU as
> > a generic object for all vendors it may have potential impact on
> > the user page fault support which Baolu is working on.
>
> > the so-called
> > fault object will be contained in vIOMMU, which is software managed
> > on VT-d/SMMU but passed through on AMD.
>
> Hmm, given we currently have no known hardware entanglement between
> PRI and VIOMMU it does seem OK for PRI to just exist seperate for
Isn't AMD vPPRLog for directly sending PRI request into the guest?
> now. If someone needs them linked someday we can add a viommu_id to
> the create pri queue command.
I'm more worried about the potential conflict between the vqueue
object here and the fault queue object in Baolu's series, if we want
to introduce vIOMMU concept to platforms which lack of the hw
support.
In that case both objects are software constructs to deliver faults.