Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
From: David Gibson
Date: Thu Apr 29 2021 - 00:18:21 EST
On Wed, Apr 28, 2021 at 09:21:49PM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 28, 2021 at 11:23:39AM +1000, David Gibson wrote:
>
> > Yes. My proposed model for a unified interface would be that when you
> > create a new container/IOASID, *no* IOVAs are valid.
>
> Hurm, it is quite tricky. All IOMMUs seem to have a dead zone around
> the MSI window, so negotiating this all in a general way is not going
> to be a very simple API.
>
> To be general it would be nicer to say something like 'I need XXGB of
> IOVA space' 'I need 32 bit IOVA space' etc and have the kernel return
> ranges that sum up to at least that big. Then the kernel can do its
> all its optimizations.
Ah, yes, sorry. We do need an API that lets the kernel make more of
the decisions too. For userspace drivers it would generally be
sufficient to just ask for XXX size of IOVA space wherever you can get
it. Handling guests requires more precision. So, maybe a request
interface with a bunch of hint variables and a matching set of
MAP_FIXED-like flags to assert which ones aren't negotiable.
> I guess you are going to say that the qemu PPC vIOMMU driver needs
> more exact control..
*Every* vIOMMU driver needs more exact control. The guest drivers
will expect to program the guest devices with IOVAs matching the guest
platform's IOMMU model. Therefore the backing host IOMMU has to be
programmed to respond to those IOVAs. If it can't be, there's no way
around it, and you want to fail out early. With this model that will
happen when qemu (say) requests the host IOMMU window(s) to match the
guest's expected IOVA ranges.
Actually, come to that even guests without a vIOMMU need more exact
control: they'll expect IOVA to match GPA, so if your host IOMMU can't
be set up translate the full range of GPAs, again, you're out of luck.
The only reason x86 has been able to ignore this is that the
assumption has been that all IOMMUs can translate IOVAs from 0..<a big
enough number for any reasonable RAM size>. Once you really start to
look at what the limits are, you need the exact window control I'm
describing.
> > I expect we'd need some kind of query operation to expose limitations
> > on the number of windows, addresses for them, available pagesizes etc.
>
> Is page size an assumption that hugetlbfs will always be used for backing
> memory or something?
So for TCEs (and maybe other IOMMUs out there), the IO page tables are
independent of the CPU page tables. They don't have the same format,
and they don't necessarily have the same page size. In the case of a
bare metal kernel working in physical addresses they can use that TCE
page size however they like. For userspace you get another layer of
complexity. Essentially to implement things correctly the backing
IOMMU needs to have a page size granularity that's the minimum of
whatever granularity the userspace or guest driver expects and the
host page size backing the memory.
> > > As an ideal, only things like the HW specific qemu vIOMMU driver
> > > should be reaching for all the special stuff.
> >
> > I'm hoping we can even avoid that, usually. With the explicitly
> > created windows model I propose above, it should be able to: qemu will
> > create the windows according to the IOVA windows the guest platform
> > expects to see and they either will or won't work on the host platform
> > IOMMU. If they do, generic maps/unmaps should be sufficient. If they
> > don't well, the host IOMMU simply cannot emulate the vIOMMU so you're
> > out of luck anyway.
>
> It is not just P9 that has special stuff, and this whole area of PASID
> seems to be quite different on every platform
>
> If things fit very naturally and generally then maybe, but I've been
> down this road before of trying to make a general description of a
> group of very special HW. It ended in tears after 10 years when nobody
> could understand the "general" API after it was Frankenstein'd up with
> special cases for everything. Cautionary tale
>
> There is a certain appeal to having some
> 'PPC_TCE_CREATE_SPECIAL_IOASID' entry point that has a wack of extra
> information like windows that can be optionally called by the viommu
> driver and it remains well defined and described.
Windows really aren't ppc specific. They're absolutely there on x86
and everything else as well - it's just that people are used to having
a window at 0..<something largish> that you can often get away with
treating it sloppily.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
Attachment:
signature.asc
Description: PGP signature