Re: [PATCH 00/21] SMMU enablement for NXP LS1043A and LS1046A

From: Robin Murphy
Date: Thu Sep 20 2018 - 07:49:48 EST

On 20/09/18 11:38, Laurentiu Tudor wrote:

On 19.09.2018 17:37, Robin Murphy wrote:
On 19/09/18 15:18, Laurentiu Tudor wrote:
Hi Robin,

On 19.09.2018 16:25, Robin Murphy wrote:
Hi Laurentiu,

On 19/09/18 13:35, laurentiu.tudor@xxxxxxx wrote:
From: Laurentiu Tudor <laurentiu.tudor@xxxxxxx>

This patch series adds SMMU support for NXP LS1043A and LS1046A chips
and consists mostly in important driver fixes and the required device
tree updates. It touches several subsystems and consists of three main
ÂÂ - changes in soc/drivers/fsl/qbman drivers adding iommu mapping of
ÂÂÂÂ reserved memory areas, fixes and defered probe support
ÂÂ - changes in drivers/net/ethernet/freescale/dpaa_eth drivers
ÂÂÂÂ consisting in misc dma mapping related fixes and probe ordering
ÂÂ - addition of the actual arm smmu device tree node together with
ÂÂÂÂ various adjustments to the device trees

Performance impact

ÂÂÂÂÂ Running iperf benchmarks in a back-to-back setup (both sides
ÂÂÂÂÂ having smmu enabled) on a 10GBps port show an important
ÂÂÂÂÂ networking performance degradation of around %40 (9.48Gbps
ÂÂÂÂÂ linerate vs 5.45Gbps). If you need performance but without
ÂÂÂÂÂ SMMU support you can use "iommu.passthrough=1" to disable

I should have said before - thanks for the numbers there as well. Always good to add another datapoint to my collection. If you're interested I've added SMMUv2 support to the "non-strict mode" series (of which I should be posting v8 soon), so it might be fun to see how well that works on MMU-500 in the real world.

USB issue and workaround

ÂÂÂÂÂ There's a problem with the usb controllers in these chips
ÂÂÂÂÂ generating smaller, 40-bit wide dma addresses instead of the
ÂÂÂÂÂ supported at the smmu input. So you end up in a situation
where the
ÂÂÂÂÂ smmu is mapped with 48-bit address translations, but the device
ÂÂÂÂÂ generates transactions with clipped 40-bit addresses, thus smmu
ÂÂÂÂÂ context faults are triggered. I encountered a similar
situation for
ÂÂÂÂÂ mmc that IÂ managed to fix in software [1] however for USB I
did not
ÂÂÂÂÂ find a proper place in the code to add a similar fix. The only
ÂÂÂÂÂ workaround I found was to add this kernel parameter which
limits the
ÂÂÂÂÂ usb dma to 32-bit size: "xhci-hcd.quirks=0x800000".
ÂÂÂÂÂ This workaround if far from ideal, so any suggestions for a code
ÂÂÂÂÂ based workaround in this area would be greatly appreciated.

If you have a nominally-64-bit device with a
narrower-than-the-main-interconnect link in front of it, that should
already be fixed in 4.19-rc by bus_dma_mask picking up DT dma-ranges,
provided the interconnect hierarchy can be described appropriately (or
at least massaged sufficiently to satisfy the binding), e.g.:

/ {

 ÂÂÂÂsoc {
 Â ranges;
 Â dma-ranges = <0 0 10000 0>;

 Â dev_48bit { ... };

 Â periph_bus {
 Â ranges;
 Â dma-ranges = <0 0 100 0>;

 Â dev_40bit { ... };
 Â };

and if that fails to work as expected (except for PCI hosts where
handling dma-ranges properly still needs sorting out), please do let us
know ;)

Just to confirm, Is this [1] the change I was supposed to test?

Not quite - dma-ranges is only valid for nodes representing a bus, so
putting it directly in the USB device nodes doesn't work (FWIW that's
why PCI is broken, because the parser doesn't expect the
bus-as-leaf-node case). That's teh point of that intermediate simple-bus
node represented by "periph_bus" in my example (sorry, I should have put
compatibles in to make it clearer) - often that's actually true to life
(i.e. "soc" is something like a CCI and "periph_bus" is something like
an AXI NIC gluing a bunch of lower-bandwidth DMA masters to one of the
CCI ports) but at worst it's just a necessary evil to make the binding
happy (if it literally only represents the point-to-point link between
the device master port and interconnect slave port).

Quick update: so I adjusted to device tree according to your example and
it works so now I can get rid of that nasty kernel arg based workaround,
yey! :-)

Cool! In fact, judging by the block diagrams on the website, the "basic peripherals and interconnect" section hanging off the side of the CCI implies that probably is true to the real topology as I imagined, so it doesn't even count as a horrible hack :)

Thanks a lot, that was really helpful.

No problem. FWIW if you ever come to doing ACPI support for these SoCs, the equivalent is merely a case of setting the device memory address size limit field appropriately for all the named components.