On Thu, Jun 22, 2023 at 06:15:17PM -0700, Suthikulpanit, Suravee wrote:
Jason,
On 6/22/2023 6:46 AM, Jason Gunthorpe wrote:
On Wed, Jun 21, 2023 at 06:54:47PM -0500, Suravee Suthikulpanit wrote:
Since the IOMMU hardware virtualizes the guest command buffer, this allows
IOMMU operations to be accelerated such as invalidation of guest pages
(i.e. stage1) when the command is issued by the guest kernel without
intervention from the hypervisor.
This is similar to what we are doing on ARM as well.
Ok
This series is implemented on top of the IOMMUFD framework. It leverages
the exisiting APIs and ioctls for providing guest iommu information
(i.e. struct iommu_hw_info_amd), and allowing guest to provide guest page
table information (i.e. struct iommu_hwpt_amd_v2) for setting up user
domain.
Please see the [4],[5], and [6] for more detail on the AMD HW-vIOMMU.
NOTES
-----
This series is organized into two parts:
* Part1: Preparing IOMMU driver for HW-vIOMMU support (Patch 1-8).
* Part2: Introducing HW-vIOMMU support (Patch 9-21).
* Patch 12 and 21 extends the existing IOMMUFD ioctls to support
additional opterations, which can be categorized into:
- Ioctls to init/destroy AMD HW-vIOMMU instance
- Ioctls to attach/detach guest devices to the AMD HW-vIOMMU instance.
- Ioctls to attach/detach guest domains to the AMD HW-vIOMMU instance.
To describe the need for this ioctl, AMD IOMMU has two set of MMIO registers:- Ioctls to trap certain AMD HW-vIOMMU MMIO register accesses.
- Ioctls to trap AMD HW-vIOMMU command buffer initialization.
No one else seems to need this kind of stuff, why is AMD different?
Emulation and mediation to create the vIOMMU is supposed to be in the
VMM side, not in the kernel. I don't want to see different models by
vendor.
These ioctl is not necessary for emulation, which I would agree that it
should be done on the VMM side (e.g. QEMU). These ioctls provides necessary
information for programming the AMD IOMMU hardware to provide
hardware-assisted virtualized IOMMU.
You have one called 'trap', it shouldn't be like this. It seems like
this is trying to parse the command buffer in the kernel, it should be
done in the VMM.
In this series, AMD IOMMU GCR3 table is actually setup when the
IOMMUFD_CMD_HWPT_ALLOC is called, which the driver provides a hook to struct
iommu_ops.domain_alloc_user().
That isn't entirely right either, the GCR3 should be programmed into
HW during iommu_domain attach.
>> The AMD-specific information is communicated from QEMU via
iommu_domain_user_data.iommu_hwpt_amd_v2. This is similar to INTEL
and ARM.
This is only for requesting the iommu_domain and supplying the gcr3 VA
for later use.
....There are still work to be done in this to fully support PASID. I'll
take a look at this next.
I would expect PASID work is only about invalidation?
To start focus only on user space page tables and kernel mediated
invalidation and fit into the same model as everyone else. This is
approx the same patches and uAPI you see for ARM and Intel. AFAICT
AMD's HW is very similar to ARM's, so you should be aligning to the
ARM design.
I think the user space page table is covered as described above.
I'm not sure, it doesn't look like it is what I would expect.
It seems that user-space is supposed to call the ioctl
IOMMUFD_CMD_HWPT_INVALIDATE for both INTEL and ARM to issue invalidation for
stage 1 page table. Please lemme know if I misunderstand the purpose of this
ioctl.
Yes, the VMM traps the invalidation and issues it like this.
However, for AMD since the HW-vIOMMU virtualizes the guest command buffer,
and when it sees the page table invalidation command in the guest command
buffer, it takes care of the invalidation using information in the DomIDMap,
which maps guest domain ID (gDomID) of a particular guest to the
corresponding host domain ID (hDomID) of the device and invalidate the
nested translation according to the specified PASID, DomID, and GVA.
The VMM should do all of this stuff. The VMM parses the command buffer
and the VMM converts the commands to invalidation ioctls.
I'm a unclear if AMD supports a mode where the HW can directly operate
a command/invalidation queue in the VM without virtualization. Eg DMA
from guest memory and deliver directly to the guest completion
interrupts.
If it always needs SW then the SW part should be in the VMM, not the
kernel. Then you don't need to load all these tables into the kernel.