Re: [PATCH v1 01/14] iommu: Add iommu_get_unmanaged_domain helper

From: Jason Gunthorpe
Date: Thu Mar 09 2023 - 19:23:28 EST


On Thu, Mar 09, 2023 at 07:04:19PM +0000, Robin Murphy wrote:

> You are literally insisting on changing this core API, to work
> around intentionally breaking an existing behaviour which could easily be
> preserved (far less invasively), for the sake of preserving some other
> theoretical behaviour that IS NOT USEFUL.

No! I am insisting that the core API *make logical sense* and not be a
random mismash of things that just happen to work. I'm not optimizing
for LOC here.

The end goal is to remove access to the S2 and not have this API at
all. It makes no sense to have a temporary step that makes the S2 even
more available then go and undo that.

This is a 'code smell' annotation that it needs to use a special API
because this code has a direct special requirement based on ARM's
definition to work on the S2 of the nest.

The goal for iommufd is to make ITS page mapping to have already
happened at domain allocation time. It cannot be deferred until
irq_domain_ops.alloc() time. Obviously once we do that we don't have a
need to obtain the S2 from a nest domain.

> The overall design of the iommu core APIs *is* being changed, because the
> core API design also follows this logic for any domain type:
>
> domain = iommu_get_domain_for_dev();
> phys = iommu_iova_to_phys(domain, iova);
> //phys meaningfully represents whether iova was valid

If a nesting domain is used then this really should not translate the
IOVA through the S2.

IMHO the proper answer for a nesting domain is the same as for an SVA
domain - EOPNOTSUP.

But more importantly it is illegal to call this API unless the caller
already has some range lock on the IOVA being queried. It is
impossible to obtain a range lock on a NEST or SVA, so this is not
allowed to be called on those domains.

Currently it looks like it crashes if something calls it with an
SVA/NEST to drive this point home :)

> Forgive me for getting wound up, but I'm a pragmatist and my tolerance for
> ignoring reality is low.

Well, I would like this settled and it seems bike-shedding to me.

> > Attaching the S1 always causes the embedded S2 to be used too - they
> > are not separable so we don't have APIs talking about about
> > "attaching" the S2.
>
> Yes, that is one way of viewing things, but it's far from the only
> way.

Sure, but it is the design we are going with. It is the design that
was built into iommufd from day 1.

If there is a good reason to change it, I'm open to hear it, but we've
gone through a lot of use cases now and it is holding up well.

> typical lifecycle will involve starting the VM with S2 alone, then enabling
> nesting later - we can view that as allocating a nested domain based on S2,
> then "replacing" S2 with nested, but we could equally view it as just
> attaching the nested domain on top of the existing S2, like pushing to a
> stack (I do agree that trying to model it as two completely independent and
> orthogonal attachments would not be sensible). It's really just semantics of
> how we prefer to describe things, and whether the s2_domain pointer is
> passed up-front to domain_alloc or later to attach_dev.

We settled on domain_alloc time for a pretty basic reason - it keeps
the invariant that the IOVA to Phys of an iommu_domain is
universal. Meaning IOVA to Phys always gives the same answer
regardless of what device is attached.

We directly disallow mixing a S1 with different S2's. I also don't
like the idea of 'first to reach attach_dev assigns the S2', partially
initialized iommu_domains have not worked well so far IMHO.

I think this makes it easier for the IOMMU to assign cache tags as the
the S1 iommu_domain always gives the same translation so it can
trivially be assigned to a cache tag.

Ie a basic Intel implementation can assign a DID to the S1. We know
that no matter what device the NEST iommu domain is used the DID is
the correct cache tag because of the universal translation rule.

ARM will assign a VMID to the S2, the S1 is actually a CD table handle
and has its own invalidation.

AFAICT AMD will assign a DID per-device to the S1, because they don't
have ASIDs in their PASID table :\

> The opaque nested domain looks clean in some ways, but on the other hand it
> also heavily obfuscates how translations may be shared between one nested
> domain and other nested and non-nested domains, such that changing mappings
> in one may affect others.

Well, it keeps the logic in iommufd which should be the only user of
this stuff.

If we develop a non-iommufd user then we can revist where the
abstractions should live, but for now iommufd is handling the thin
common abstraction.

> This is a major and potentially surprising
> paradigm shift away from the current notion that all domains represent
> individual isolated address spaces, so abstracting it more openly within the
> core API internals would be clearer about what's really going on. And
> putting common concepts at common levels of abstraction makes things simpler
> and is why we have core API code at all.

I'm all for more commen concepts, but I'm pragmatic here - I'd like to
see duplicated code in the drivers become unduplicated by the
abstraction. I'm not sure what this is in the S2 area, so far I
haven't noticed anything in the ARM and Intel patch series.

Without a need for S2 pointers in any code outside iommufd and the
driver I'd prefer to keep APIs out of there to prevent abuse.

> ultimately be a good deal neater and more productive than adding yet more
> special-case ops that every driver is just going to implement identically.

To be clear for this patch only SMMUv3 will implement this op because
only SMMUv3 has this ITS problem. Once we fix it the op will be
deleted.

It really is a hack because we are all too scared to fix this properly
right now :)

> And to even imagine the future possibility of things like S2 pagetable
> sharing with KVM, or unpinned S2 with ATS/PRI or SMMU stalls, by tunnelling
> more private nesting interfaces directly between IOMMUFD and individual
> drivers without some degree of common IOMMU API abstraction in between...
> no. Just... no.

I'm all for abstractions that remove duplicated code from drivers, I'm
just not sure right now what they will actualy be...

Like I say, I don't view this as duplicated code, this is hack for ARM
to temporarily break the architectural model of a hidden S2, not true
lasting design.

Thanks,
Jason