Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
From: david@xxxxxxxxxxxxxxxxxxxxx
Date: Fri Oct 01 2021 - 02:31:05 EST
On Thu, Sep 23, 2021 at 09:14:58AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> > Sent: Wednesday, September 22, 2021 10:09 PM
> >
> > On Wed, Sep 22, 2021 at 03:40:25AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> > > > Sent: Wednesday, September 22, 2021 1:45 AM
> > > >
> > > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > > > This patch adds IOASID allocation/free interface per iommufd. When
> > > > > allocating an IOASID, userspace is expected to specify the type and
> > > > > format information for the target I/O page table.
> > > > >
> > > > > This RFC supports only one type
> > (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > > > semantics. For this type the user should specify the addr_width of
> > > > > the I/O address space and whether the I/O page table is created in
> > > > > an iommu enfore_snoop format. enforce_snoop must be true at this
> > point,
> > > > > as the false setting requires additional contract with KVM on handling
> > > > > WBINVD emulation, which can be added later.
> > > > >
> > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next
> > patch)
> > > > > for what formats can be specified when allocating an IOASID.
> > > > >
> > > > > Open:
> > > > > - Devices on PPC platform currently use a different iommu driver in vfio.
> > > > > Per previous discussion they can also use vfio type1v2 as long as there
> > > > > is a way to claim a specific iova range from a system-wide address
> > space.
> > > > > This requirement doesn't sound PPC specific, as addr_width for pci
> > > > devices
> > > > > can be also represented by a range [0, 2^addr_width-1]. This RFC
> > hasn't
> > > > > adopted this design yet. We hope to have formal alignment in v1
> > > > discussion
> > > > > and then decide how to incorporate it in v2.
> > > >
> > > > I think the request was to include a start/end IO address hint when
> > > > creating the ios. When the kernel creates it then it can return the
> > >
> > > is the hint single-range or could be multiple-ranges?
> >
> > David explained it here:
> >
> > https://lore.kernel.org/kvm/YMrKksUeNW%2FPEGPM@yekko/
> >
> > qeumu needs to be able to chooose if it gets the 32 bit range or 64
> > bit range.
> >
> > So a 'range hint' will do the job
> >
> > David also suggested this:
> >
> > https://lore.kernel.org/kvm/YL6%2FbjHyuHJTn4Rd@yekko/
> >
> > So I like this better:
> >
> > struct iommu_ioasid_alloc {
> > __u32 argsz;
> >
> > __u32 flags;
> > #define IOMMU_IOASID_ENFORCE_SNOOP (1 << 0)
> > #define IOMMU_IOASID_HINT_BASE_IOVA (1 << 1)
> >
> > __aligned_u64 max_iova_hint;
> > __aligned_u64 base_iova_hint; // Used only if
> > IOMMU_IOASID_HINT_BASE_IOVA
> >
> > // For creating nested page tables
> > __u32 parent_ios_id;
> > __u32 format;
> > #define IOMMU_FORMAT_KERNEL 0
> > #define IOMMU_FORMAT_PPC_XXX 2
> > #define IOMMU_FORMAT_[..]
> > u32 format_flags; // Layout depends on format above
> >
> > __aligned_u64 user_page_directory; // Used if parent_ios_id != 0
> > };
> >
> > Again 'type' as an overall API indicator should not exist, feature
> > flags need to have clear narrow meanings.
>
> currently the type is aimed to differentiate three usages:
>
> - kernel-managed I/O page table
> - user-managed I/O page table
> - shared I/O page table (e.g. with mm, or ept)
>
> we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good
> indicator? their difference is not about format.
To me "format" indicates how the IO translation information is
encoded. We potentially have two different encodings: from userspace
to the kernel and from the kernel to the hardware. But since this is
the userspace API, it's only the userspace to kernel one that matters
here.
In that sense, KERNEL, is a "format": we encode the translation
information as a series of IOMAP operations to the kernel, rather than
as an in-memory structure.
> > This does both of David's suggestions at once. If quemu wants the 1G
> > limited region it could specify max_iova_hint = 1G, if it wants the
> > extend 64bit region with the hole it can give either the high base or
> > a large max_iova_hint. format/format_flags allows a further
>
> Dave's links didn't answer one puzzle from me. Does PPC needs accurate
> range information or be ok with a large range including holes (then let
> the kernel to figure out where the holes locate)?
I need more specifics to answer that. Are you talking from a
userspace PoV, a guest kernel's or the host kernel's? In general I
think requiring userspace to locate and work aronud holes is a bad
idea. If userspace requests a range, it should get *all* of that
range.
The ppc case is further complicated because there are multiple ranges
and each range could have separate IO page tables. In practice
non-kernel managed IO pagetables are likely to be hard on ppc (or at
least rely on firmware/hypervisor interfaces which don't exist yet,
AFAIK). But even then, the underlying hardware page table format can
affect the minimum pagesize of each range, which could be different.
How all of this interacts with PASIDs I really haven't figured out.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
Attachment:
signature.asc
Description: PGP signature