RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

From: Tian, Kevin
Date: Tue May 11 2021 - 18:54:06 EST

Next message: Paul E. McKenney: "[PATCH tip/core/rcu 16/19] sched/isolation: reconcile rcu_nocbs= and nohz_full="
Previous message: Paul E. McKenney: "[PATCH tip/core/rcu 19/19] rcu: Add missing __releases() annotation"
In reply to: Liu Yi L: "Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs"
Next in thread: Jason Gunthorpe: "Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> From: Liu Yi L <yi.l.liu@xxxxxxxxxxxxxxx>
> Sent: Tuesday, May 11, 2021 9:25 PM
>
> On Tue, 11 May 2021 09:10:03 +0000, Tian, Kevin wrote:
>
> > > From: Jason Gunthorpe
> > > Sent: Monday, May 10, 2021 8:37 PM
> > >
> > [...]
> > > > gPASID!=hPASID has a problem when assigning a physical device which
> > > > supports both shared work queue (ENQCMD with PASID in MSR)
> > > > and dedicated work queue (PASID in device register) to a guest
> > > > process which is associated to a gPASID. Say the host kernel has setup
> > > > the hPASID entry with nested translation though /dev/ioasid. For
> > > > shared work queue the CPU is configured to translate gPASID in MSR
> > > > into **hPASID** before the payload goes out to the wire. However
> > > > for dedicated work queue the device MMIO register is directly mapped
> > > > to and programmed by the guest, thus containing a **gPASID** value
> > > > implying DMA requests through this interface will hit IOMMU faults
> > > > due to invalid gPASID entry. Having gPASID==hPASID is a simple
> > > > workaround here. mdev doesn't have this problem because the
> > > > PASID register is in emulated control-path thus can be translated
> > > > to hPASID manually by mdev driver.
> > >
> > > This all must be explicit too.
> > >
> > > If a PASID is allocated and it is going to be used with ENQCMD then
> > > everything needs to know it is actually quite different than a PASID
> > > that was allocated to be used with a normal SRIOV device, for
> > > instance.
> > >
> > > The former case can accept that the guest PASID is virtualized, while
> > > the lattter can not.
> > >
> > > This is also why PASID per RID has to be an option. When I assign a
> > > full SRIOV function to the guest then that entire RID space needs to
> > > also be assigned to the guest. Upon migration I need to take all the
> > > physical PASIDs and rebuild them in another hypervisor exactly as is.
> > >
> > > If you force all RIDs into a global PASID pool then normal SRIOV
> > > migration w/PASID becomes impossible. ie ENQCMD breaks everything
> else
> > > that should work.
> > >
> > > This is why you need to sort all this out and why it feels like some
> > > of the specs here have been mis-designed.
> > >
> > > I'm not sure carving out ranges is really workable for migration.
> > >
> > > I think the real answer is to carve out entire RIDs as being in the
> > > global pool or not. Then the ENQCMD HW can be bundled together and
> > > everything else can live in the natural PASID per RID world.
> > >
> >
> > OK. Here is the revised scheme by making it explicitly.
> >
> > There are three scenarios to be considered:
> >
> > 1) SR-IOV (AMD/ARM):
> > - "PASID per RID" with guest-allocated PASIDs;
> > - PASID table managed by guest (in GPA space);
> > - the entire PASID space delegated to guest;
> > - no need to explicitly register guest-allocated PASIDs to host;
> > - uAPI for attaching PASID table:
> >
> > // set to "PASID per RID"
> > ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_LOCAL);
> >
> > // When Qemu captures a new PASID table through vIOMMU;
> > pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> > ioctl(device_fd, VFIO_ATTACH_IOASID, pasidtbl_ioasid);
> >
> > // Set the PASID table to the RID associated with pasidtbl_ioasid;
> > ioctl(ioasid_fd, IOASID_SET_PASID_TABLE, pasidtbl_ioasid, gpa_addr);
> >
> > 2) SR-IOV, no ENQCMD (Intel):
> > - "PASID per RID" with guest-allocated PASIDs;
> > - PASID table managed by host (in HPA space);
> > - the entire PASID space delegated to guest too;
> > - host must be explicitly notified for guest-allocated PASIDs;
> > - uAPI for binding user-allocated PASIDs:
> >
> > // set to "PASID per RID"
> > ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_LOCAL);
> >
> > // When Qemu captures a new PASID allocated through vIOMMU;
>
> Is this achieved by VCMD or by capturing guest's PASID cache invalidation?

The latter one

>
> > pgtbl_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> > ioctl(device_fd, VFIO_ATTACH_IOASID, pgtbl_ioasid);
> >
> > // Tell the kernel to associate pasid to pgtbl_ioasid in internal structure;
> > // &pasid being a pointer due to a requirement in scenario-3
> > ioctl(ioasid_fd, IOASID_SET_HWID, pgtbl_ioasid, &pasid);
> >
> > // Set guest page table to the RID+pasid associated to pgtbl_ioasid
> > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, pgtbl_ioasid, gpa_addr);
> >
> > 3) SRIOV, ENQCMD (Intel):
> > - "PASID global" with host-allocated PASIDs;
> > - PASID table managed by host (in HPA space);
> > - all RIDs bound to this ioasid_fd use the global pool;
> > - however, exposing global PASID into guest breaks migration;
> > - hybrid scheme: split local PASID range and global PASID range;
> > - force guest to use only local PASID range (through vIOMMU);
> > - for ENQCMD, configure CPU to translate local->global;
> > - for non-ENQCMD, setup both local/global pasid entries;
> > - uAPI for range split and CPU pasid mapping:
> >
> > // set to "PASID global"
> > ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_GLOBAL);
> >
> > // split local/global range, applying to all RIDs in this fd
> > // Example: local [0, 1024), global [1024, max)
> > // local PASID range is managed by guest and migrated as VM state
> > // global PASIDs are re-allocated and mapped to local PASIDs post
> migration
> > ioctl(ioasid_fd, IOASID_HWID_SET_GLOBAL_MIN, 1024);
> >
> > // When Qemu captures a new local_pasid allocated through vIOMMU;
> > pgtbl_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> > ioctl(device_fd, VFIO_ATTACH_IOASID, pgtbl_ioasid);
> >
> > // Tell the kernel to associate local_pasid to pgtbl_ioasid in internal
> structure;
> > // Due to HWID_GLOBAL, the kernel also allocates a global_pasid from
> the
> > // global pool. From now on, every hwid related operations must be
> applied
> > // to both PASIDs for this page table;
> > // global_pasid is returned to userspace in the same field as local_pasid;
> > ioctl(ioasid_fd, IOASID_SET_HWID, pgtbl_ioasid, &local_pasid);
> >
> > // Qemu then registers local_pasid/global_pasid pair to KVM for setting
> up
> > // CPU PASID translation table;
> > ioctl(kvm_fd, KVM_SET_PASID_MAPPING, local_pasid, global_pasid);
> >
> > // Set guest page table to the RID+{local_pasid, global_pasid} associated
> > // to pgtbl_ioasid;
> > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, pgtbl_ioasid, gpa_addr);
> >
> > -----
> > Notes:
> >
> > I tried to keep common commands in generic format for all scenarios, while
> > introducing new commands for usage-specific requirement. Everything is
> > made explicit now.
> >
> > The userspace has sufficient information to choose its desired scheme
> based
> > on vIOMMU types and platform information (e.g. whether ENQCMD is
> exposed
> > in virtual CPUID, whether assigned devices support DMWr, etc.).
> >
> > Above example assumes one RID per bound page table, because vIOMMU
> > identifies new guest page tables per-RID. If there are other usages requiring
> > multiple RIDs per page table, SET_HWID/BIND_PGTABLE could accept
> > another device_handle parameter to specify which RID is targeted for this
> > operation.
> >
> > When considering SIOV/mdev there is no change to above uAPI sequence.
> > It's n/a for 1) as SIOV requires PASID table in HPA space, nor does it
> > cause any change to 3) regarding to the split range scheme. The only
> > conceptual change is in 2), where although it's still "PASID per RID" the
> > PASIDs must be managed by host because the parent driver also allocates
> > PASIDs from per-RID space to mark mdev (RID+PASID). But this difference
> > doesn't change the uAPI flow - just treat user-provisioned PASID as 'virtual'
> > and then allocate a 'real' PASID at IOASID_SET_HWID. Later always use the
> > real one when programming PASID entry (IOASID_BIND_PGTABLE) or
> device
> > PASID register (converted in the mediation path).
> >
> > If all above can work reasonably, we even don't need the special VCMD
> > interface in VT-d for guest to allocate PASIDs from host. Just always let
> > the guest to manage its PASIDs (with restriction of available local PASIDs),
> > being a global allocator or per-RID allocator. the vIOMMU side just stick
> > to the per-RID emulation according to the spec.
>
> yeah, if this scheme for scenario 3) is good. We may limit the range of
> local PASIDs by limiting the PASID bit width of vIOMMU. QEMU can get the
> local PASID allocated by guest IOMMU when guest does PASID cache
> invalidation.
>
> --
> Regards,
> Yi Liu

Next message: Paul E. McKenney: "[PATCH tip/core/rcu 16/19] sched/isolation: reconcile rcu_nocbs= and nohz_full="
Previous message: Paul E. McKenney: "[PATCH tip/core/rcu 19/19] rcu: Add missing __releases() annotation"
In reply to: Liu Yi L: "Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs"
Next in thread: Jason Gunthorpe: "Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]