Re: [PATCH v4 3/3] iommu: dart: Add DART iommu driver

From: Sven Peter
Date: Sun Jul 25 2021 - 08:41:19 EST




On Mon, Jul 19, 2021, at 20:15, Robin Murphy wrote:
> On 2021-07-15 17:41, Sven Peter via iommu wrote:
> [...]
> >>> + u64 sw_bypass_cpu_start;
> >>> + u64 sw_bypass_dma_start;
> >>> + u64 sw_bypass_len;
> >>> +
> >>> + struct list_head streams;
> >>
> >> I'm staring to think this could just be a bitmap, in a u16 even.
> >
> > The problem is that these streams may come from two different
> > DART instances. That is required for e.g. the dwc3 controller which
> > has a weird quirk where DMA transactions go through two separate
> > DARTs with no clear pattern (e.g. some xhci control structures use the
> > first dart while other structures use the second one).
>
> Ah right, I do remember discussing that situation, but I think I
> misinterpreted dart_domain->dart representing "the DART instance" here
> to mean we weren't trying to accommodate that just yet.
>
> > Both of them need to point to the same pagetable.
> > In the device tree the node will have an entry like this:
> >
> > dwc3_0: usb@382280000{
> > ...
> > iommus = <&dwc3_0_dart_0 0>, <&dwc3_0_dart_1 1>;
> > };
> >
> > There's no need for a linked list though once I do this properly with
> > groups. I can just use an array allocated when the first device is
> > attached, which just contains apple_dart* and streamid values.
> >
> >
> >>
> >>> +
> >>> + spinlock_t lock;
> >>> +
> >>> + struct iommu_domain domain;
> >>> +};
> >>> +
> >>> +/*
> >>> + * This structure is attached to devices with dev_iommu_priv_set() on of_xlate
> >>> + * and contains a list of streams bound to this device as defined in the
> >>> + * device tree. Multiple DART instances can be attached to a single device
> >>> + * and each stream is identified by its stream id.
> >>> + * It's usually reference by a pointer called *cfg.
> >>> + *
> >>> + * A dynamic array instead of a linked list is used here since in almost
> >>> + * all cases a device will just be attached to a single stream and streams
> >>> + * are never removed after they have been added.
> >>> + *
> >>> + * @num_streams: number of streams attached
> >>> + * @streams: array of structs to identify attached streams and the device link
> >>> + * to the iommu
> >>> + */
> >>> +struct apple_dart_master_cfg {
> >>> + int num_streams;
> >>> + struct {
> >>> + struct apple_dart *dart;
> >>> + u32 sid;
> >>
> >> Can't you use the fwspec for this?
> >
> >
> > I'd be happy to use the fwspec code if that's somehow possible.
> > I'm not sure how though since I need to store both the reference to the DART
> > _and_ to the stream id. As far as I can tell the fwspec code would only allow
> > to store the stream ids.
> > (see also the previous comment regarding the dwc3 node which requires stream
> > ids from two separate DART instances)
>
> Hmm, yes, as above I was overlooking that, although there are still
> various ideas that come to mind; the question becomes whether they're
> actually worthwhile or just too-clever-for-their-own-good hacks. The
> exact format of fwspec->ids is not fixed (other than the ACPI IORT code
> having a common understanding with the Arm SMMU drivers) so in principle
> you could munge some sort of DART instance index or indeed anything, but
> if it remains cleaner to manage your own data internally then by all
> means keep doing that.

Yeah, I can think of some hacks as well (like storing a global id->apple_dart* map
or stuffing the 64bit pointer into two ints) and I've tried a few of them in the past
days but didn't like either of them.

I do like the idea to just put two (struct apple_dart *dart, u16 sidmap)
in there though which will be plenty for all current configurations.

>
> >>> + struct device_link *link;
> >>
> >> Is it necessary to use stateless links, or could you use
> >> DL_FLAG_AUTOREMOVE_SUPPLIER and not have to keep track of them manually?
> >
> > I'll just use DL_FLAG_AUTOREMOVE_SUPPLIER. No idea why I went for stateless links.
> >
> >>
> > [...]
> >>> + /* restore stream identity map */
> >>> + writel(0x03020100, dart->regs + DART_STREAM_REMAP);
> >>> + writel(0x07060504, dart->regs + DART_STREAM_REMAP + 4);
> >>> + writel(0x0b0a0908, dart->regs + DART_STREAM_REMAP + 8);
> >>> + writel(0x0f0e0d0c, dart->regs + DART_STREAM_REMAP + 12);
> >>
> >> Any hint of what the magic numbers mean?
> >
> > Yes, it's just 0,1,2,3...,0xe,0xf but I can't do 8bit writes to the bus
> > and 32 bit writes then require these slightly awkward "swapped" numbers.
> > I'll add a comment since it's not obvious at first glance.
>
> Sure, I guessed that much from "identity map" - it was more a question
> of why that means 0x03020100... rather than, say, 0x0c0d0e0f... or
> 0x76543210..., and perhaps the reason for "restoring" it in the first place.

So what this feature does is to allow the DART to take an incoming DMA stream
tagged with id i and pretend that it's actually been tagged with
readb(dart->regs + 0x80 + i) instead. That's as much as I can figure out by
poking the hardware. More details are probably only available to Apple.

Now the reason I thought I needed this was that I assumed we are handed these DARTs
in an unclean state because Apple makes use of this internally:
In their device tree they have a sid-remap property which I believe is a hack to make
their driver simpler. The dwc3 controller requires stream 0 of dartA and stream 1 of
dartB to be configured the same way. They configure dartB to remap stream 1 to stream 0
and then just mirror all MMIO writes from dartA to dartB and pretend that dwc3 only
needs a single DART.

As it actually turns out though, iBoot doesn't use the USB DARTs and we already get
them in the sane state. I can just drop this code. (And if we actually need it
for other DARTs I can also just restore those in our bootloader or add it in a
follow up).

>
> [...]
> >>> + /*
> >>> + * we can't mix and match DARTs that support bypass mode with those who don't
> >>> + * because the iova space in fake bypass mode generally has an offset
> >>> + */
> >>
> >> Erm, something doesn't sound right there... IOMMU_DOMAIN_IDENTITY should
> >> be exactly what it says, regardless of how it's implemented. If you
> >> can't provide a true identity mapping then you're probably better off
> >> not pretending to support them in the first place.
> >
> > Some background: the PCIe DART only supports a 32bit VA space but RAM
> > on these machines starts at 0x8_0000_0000. I have something like
> > dma-ranges = <0x42000000 0 0 0x8 0 0 0xffff0000>;
> > in the pcie nodes to add that offset to dma addresses.
> >
> > What I want to do here then is to setup an identity mapping with respect
> > to the DMA layer understanding of addresses encoded in bus_dma_region.
> > Now this will always just be a constant offset of 0x8_0000_0000 for
> > all M1s but I didn't want to hardcode that.
> > The code here is just there to guard against a situation where someone
> > somehow manages to attach two devices with different offsets to the same
> > domain.
>
> Urgh, *now* I think I get it - the addressing limitation WRT the
> physical memory map layout had also slipped my mind. So you describe the
> RC *as if* it had a physical bus offset, rely on iommu-dma ignoring it
> when active (which is more by luck than design - we don't expect to ever
> see a device with a real hard-wired offset upstream of an IOMMU,
> although I did initially try to support it back in the very early days),
> and otherwise statically program a translation such that anyone else who
> *does* respect bus_dma_regions finds things work as expected.

Yes, exactly. It's not very nice but it works...

>
> That actually seems like an even stronger argument for having the
> fake-bypass table belong to the DART rather than the domain, and at that
> point you shouldn't even need the mismatch restriction, since as long as
> you haven't described the fake offset for any devices who *can* achieve
> real bypass, then "attach to an identity domain" simply comes down to
> doing the appropriate thing for each individual stream, regardless of
> whether it's the same nominal identity domain that another device is
> using or a distinct one (it's highly unlikely that two groups would ever
> get attached to one identity domain rather than simply having their own
> anyway, but it is technically possible).
>

Agreed. That sounds a lot nicer actually.


> > If that's not how the abstraction is supposed to work and/or too big of a hack
> > I'll just remove the software bypass mode altogether.
> > PCIe won't work on 4k kernels then but the only people using this so far
> > build their own kernels with patches either way and won't complain.
> > And by the time Linux will actually be useful for "normal" setups
> > the dma-iommu layer can hopefully just handle a larger page granularity.
>
> It's certainly... "creative", and TBH I don't hate it (in a "play the
> hand you've been given" kind of way), but the one significant downside
> is that if the DART driver isn't loaded for any reason, PCI DMA will
> look like it should be usable but then just silently (or not so
> silently) fail.

Good point!

>
> FWIW if you do want to keep the option open, I'd be inclined to have the
> DT just give an "honest" description of just the 32-bit limitation, then
> have the DART driver's .probe_device sneakily modify the bus_dma_region
> to match the relevant fake-bypass table as appropriate. It's possible
> other folks might hate that even more though :D

I've given that one a try and I kinda like it so far :D
I'll keep it for v5 and just drop it in case someone complains.

>
> >>> + if (WARN_ON(domain->type == IOMMU_DOMAIN_IDENTITY &&
> >>> + (domain->dart->supports_bypass != dart->supports_bypass)))
> >>> + return -EINVAL;
> >>> +
> >>> + list_for_each_entry(stream, &domain->streams, stream_head) {
> >>> + if (stream->dart == dart && stream->sid == sid) {
> >>> + stream->num_devices++;
> >>> + return 0;
> >>> + }
> >>> + }
> >>> +
> >>> + spin_lock_irqsave(&dart->lock, flags);
> >>> +
> >>> + if (WARN_ON(dart->used_sids & BIT(sid))) {
> >>> + ret = -EINVAL;
> >>> + goto error;
> >>> + }
> >>> +
> >>> + stream = kzalloc(sizeof(*stream), GFP_ATOMIC);
> >>> + if (!stream) {
> >>> + ret = -ENOMEM;
> >>> + goto error;
> >>> + }
> >>
> >> Couldn't you do this outside the lock? (If, calling back to other
> >> comments, it can't get refactored out of existence anyway)
> >
> > Probably, but I'll first see if I can just refactor it away.
>
> Actually I missed that we're already under dart_domain->lock at this
> point anyway, so it's not going to make much difference, but it does
> mean that the spin_lock_irqsave() above could just be spin_lock(),
> unless it's possible to relax the domain locking a bit such that we
> don't have to do the whole domain init with IRQs masked.

I can relax the locking quite a bit.
Right now, I only need a spinlock around the TLB flush MMIO writes
and a mutex to protect domain initialization.

>
> [...]
> >>> +static struct iommu_domain *apple_dart_domain_alloc(unsigned int type)
> >>> +{
> >>> + struct apple_dart_domain *dart_domain;
> >>> +
> >>> + if (type != IOMMU_DOMAIN_DMA && type != IOMMU_DOMAIN_UNMANAGED &&
> >>> + type != IOMMU_DOMAIN_IDENTITY && type != IOMMU_DOMAIN_BLOCKED)
> >>> + return NULL;
> >>
> >> I want to say there's not much point in that, but then I realise I've
> >> spent the last couple of days writing patches to add a new domain type :)
> >
> > Hah! Just because I'm curious: What is that new domain type going to be? :)
>
> Splitting IOMMU_DOMAIN_DMA into two to replace iommu_dma_strict being an
> orthogonal thing.
>
> [...]
> >>> +static int apple_dart_of_xlate(struct device *dev, struct of_phandle_args *args)
> >>> +{
> >>> + struct platform_device *iommu_pdev = of_find_device_by_node(args->np);
> >>> + struct apple_dart_master_cfg *cfg = dev_iommu_priv_get(dev);
> >>> + unsigned int num_streams = cfg ? cfg->num_streams : 0;
> >>> + struct apple_dart_master_cfg *cfg_new;
> >>> + struct apple_dart *dart = platform_get_drvdata(iommu_pdev);
> >>> +
> >>> + if (args->args_count != 1)
> >>> + return -EINVAL;
> >>> +
> >>> + cfg_new = krealloc(cfg, struct_size(cfg, streams, num_streams + 1),
> >>> + GFP_KERNEL);
> >>> + if (!cfg_new)
> >>> + return -ENOMEM;
> >>> +
> >>> + cfg = cfg_new;
> >>> + dev_iommu_priv_set(dev, cfg);
> >>> +
> >>> + cfg->num_streams = num_streams;
> >>> + cfg->streams[cfg->num_streams].dart = dart;
> >>> + cfg->streams[cfg->num_streams].sid = args->args[0];
> >>> + cfg->num_streams++;
> >>
> >> Yeah, this is way too reminiscent of the fwspec code for comfort. Even
> >> if you can't use autoremove links for some reason, an array of 16
> >> device_link pointers hung off apple_dart still wins over these little
> >> pointer-heavy structures if you need more than a few of them.
> >
> > I can get rid of the links, but I'll still need some way to store
> > both the apple_dart and the sid here. Like mentioned above, I'll
> > be happy to reuse the fwspec code but I don't see how yet.
>
> As before, if you can fit in some kind of DART instance identifier which
> isn't impractical to unpack than it makes sense to use the fwspec since
> it's already there. However if you still need to allocate something
> per-device rather than just stashing an existing pointer in iommu_priv,
> then you may as well keep everything together there. If the worst known
> case could still fit in just two DART pointers and two 64-bit bitmaps,
> I'd be inclined to just have that as a fixed structure and save all the
> extra bother - you're not cross-architecture like the fwspec code, and
> arm64's minimum kmalloc granularity has just gone back up to 128 bytes
> (but even at 64 bytes you'd have had plenty of room).

That's a very good point, I somehow tried to make this part as general
as possible and didn't realize that this only has to work on essentially
one SoC for now. I also don't expect Apple to require more than two
DARTs for a single master in the future.

I've tried the fixed structure now and I really like it so far.

>
> [...]
> >>> +static int apple_dart_remove(struct platform_device *pdev)
> >>> +{
> >>> + struct apple_dart *dart = platform_get_drvdata(pdev);
> >>> +
> >>> + devm_free_irq(dart->dev, dart->irq, dart);
> >>> +
> >>> + iommu_device_unregister(&dart->iommu);
> >>> + iommu_device_sysfs_remove(&dart->iommu);
> >>> +
> >>> + clk_bulk_disable(dart->num_clks, dart->clks);
> >>> + clk_bulk_unprepare(dart->num_clks, dart->clks);
> >>
> >> Ditto.
> >>
> >> And again the bus ops are still installed - that'll get really fun if
> >> this is a module unload...
> >
> > Ugh, yeah. I'll fix that as well. I'll have to see how to make this work
> > correctly with multiple DART instances. I guess I should only remove the
> > bus ops once the last one is removed. Now that I think about it, this
> > could also get tricky in the cleanup paths of apple_dart_probe.
> >
> > Maybe just add a module_init that sets up the bus ops when it finds at
> > least one DART node and module_exit to tear them down again?
>
> Actually by this point it was late and I wasn't thinking as clearly as I
> could have been, apologies ;)
>
> I believe a module unload is in fact the *only* time you should expect
> to see .remove called - you want to set .suppress_bind_attrs in your
> driver data because there's basically no way to prevent manual unbinding
> from blowing up - so it should be fine to unconditionally clear the ops
> at this point (being removed means you must have successfully probed, so
> any ops must be yours).

Makes sense, thanks!

I'll let my current version simmer for a bit and wait until it's been
tested by a few people and will send it in a week or so then!


Best,

Sven