Re: [PATCH 2/3] iommu/pci: reserve iova for PCI masters

From: Robin Murphy
Date: Fri May 05 2017 - 11:51:24 EST


On 04/05/17 19:52, Oza Oza wrote:
> On Thu, May 4, 2017 at 11:50 PM, Robin Murphy <robin.murphy@xxxxxxx> wrote:
>> On 03/05/17 05:46, Oza Pawandeep wrote:
>>> this patch reserves the iova for PCI masters.
>>> ARM64 based SOCs may have scattered memory banks.
>>> such as iproc based SOC has
>>>
>>> <0x00000000 0x80000000 0x0 0x80000000>, /* 2G @ 2G */
>>> <0x00000008 0x80000000 0x3 0x80000000>, /* 14G @ 34G */
>>> <0x00000090 0x00000000 0x4 0x00000000>, /* 16G @ 576G */
>>> <0x000000a0 0x00000000 0x4 0x00000000>; /* 16G @ 640G */
>>>
>>> but incoming PCI transcation addressing capability is limited
>>> by host bridge, for example if max incoming window capability
>>> is 512 GB, then 0x00000090 and 0x000000a0 will fall beyond it.
>>>
>>> to address this problem, iommu has to avoid allocating iovas which
>>> are reserved. which inturn does not allocate iova if it falls into hole.
>>
>> I don't necessarily disagree with doing this, as we could do with facing
>> up to the issue of discontiguous DMA ranges in particular (I too have a
>> platform with this problem), but I'm still not overly keen on pulling DT
>> specifics into this layer. More than that, though, if we are going to do
>> it, then we should do it for all devices with a restrictive
>> "dma-ranges", not just PCI ones.
>>
>
> How do you propose to do it ?
>
> my thinking is this:
> iova_reserve_pci_windows is written specific for PCI, and I am adding there.
>
> ideally
> struct pci_host_bridge should have new member:
>
> struct list_head inbound_windows; /* resource_entry */
>
> but somehow this resource have to be filled much before
> iommu_dma_init_domain happens.
> and use brdge resource directly in iova_reserve_pci_windows as it is
> already doing it for outbound memory.
>
> this will detach the DT specifics from dma-immu layer.
> let me know how this sounds.

Please check the state of the code currently queued in Joerg's tree and
in linux-next - iommu_dma_get_resv_regions() has room for
device-agnostic stuff before the if (!dev_is_pci(dev)) check.

Furthermore, with the probe-deferral changes we end up with a common
dma_configure() routine to abstract the firmware-specifics of
of_dma_configure() vs. acpi_dma_configure(), so it would make sense to
give drivers etc. a similar interface for interrogating ranges. i.e.
some common function that abstracts the difference between parsing DT
dma-ranges vs. the ACPI _DMA object, either with a list-based get/put
model or perhaps an iterator with a user-provided callback (so users
could process in-place or create their own list as necessary). Unless of
course we go all the way to making the ranges an inherent part of the
device layer like some MIPS platforms currently do.

Robin.

>>> Bug: SOC-5216
>>> Change-Id: Icbfc99a045d730be143fef427098c937b9d46353
>>> Signed-off-by: Oza Pawandeep <oza.oza@xxxxxxxxxxxx>
>>> Reviewed-on: http://gerrit-ccxsw.broadcom.net/40760
>>> Reviewed-by: vpx_checkpatch status <vpx_checkpatch@xxxxxxxxxxxx>
>>> Reviewed-by: CCXSW <ccxswbuild@xxxxxxxxxxxx>
>>> Tested-by: vpx_autobuild status <vpx_autobuild@xxxxxxxxxxxx>
>>> Tested-by: vpx_smoketest status <vpx_smoketest@xxxxxxxxxxxx>
>>> Tested-by: CCXSW <ccxswbuild@xxxxxxxxxxxx>
>>> Reviewed-by: Scott Branden <scott.branden@xxxxxxxxxxxx>
>>>
>>> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
>>> index 48d36ce..08764b0 100644
>>> --- a/drivers/iommu/dma-iommu.c
>>> +++ b/drivers/iommu/dma-iommu.c
>>> @@ -27,6 +27,7 @@
>>> #include <linux/iova.h>
>>> #include <linux/irq.h>
>>> #include <linux/mm.h>
>>> +#include <linux/of_pci.h>
>>> #include <linux/pci.h>
>>> #include <linux/scatterlist.h>
>>> #include <linux/vmalloc.h>
>>> @@ -171,8 +172,12 @@ static void iova_reserve_pci_windows(struct pci_dev *dev,
>>> struct iova_domain *iovad)
>>> {
>>> struct pci_host_bridge *bridge = pci_find_host_bridge(dev->bus);
>>> + struct device_node *np = bridge->dev.parent->of_node;
>>> struct resource_entry *window;
>>> unsigned long lo, hi;
>>> + int ret;
>>> + dma_addr_t tmp_dma_addr = 0, dma_addr;
>>> + LIST_HEAD(res);
>>>
>>> resource_list_for_each_entry(window, &bridge->windows) {
>>> if (resource_type(window->res) != IORESOURCE_MEM &&
>>> @@ -183,6 +188,36 @@ static void iova_reserve_pci_windows(struct pci_dev *dev,
>>> hi = iova_pfn(iovad, window->res->end - window->offset);
>>> reserve_iova(iovad, lo, hi);
>>> }
>>> +
>>> + /* PCI inbound memory reservation. */
>>> + ret = of_pci_get_dma_ranges(np, &res);
>>> + if (!ret) {
>>> + resource_list_for_each_entry(window, &res) {
>>> + struct resource *res_dma = window->res;
>>> +
>>> + dma_addr = res_dma->start - window->offset;
>>> + if (tmp_dma_addr > dma_addr) {
>>> + pr_warn("PCI: failed to reserve iovas; ranges should be sorted\n");
>>
>> I don't see anything in the DT spec about the entries having to be
>> sorted, and it's not exactly impossible to sort a list if you need it so
>> (and if I'm being really pedantic, one could still trigger this with a
>> list that *is* sorted, only by different criteria).
>>
>
> we have to sort it the way we want then. I can make it sort then.
> thanks for the suggestion.
>
>> Robin.
>>
>>> + return;
>>> + }
>>> + if (tmp_dma_addr != dma_addr) {
>>> + lo = iova_pfn(iovad, tmp_dma_addr);
>>> + hi = iova_pfn(iovad, dma_addr - 1);
>>> + reserve_iova(iovad, lo, hi);
>>> + }
>>> + tmp_dma_addr = window->res->end - window->offset;
>>> + }
>>> + /*
>>> + * the last dma-range should honour based on the
>>> + * 32/64-bit dma addresses.
>>> + */
>>> + if (tmp_dma_addr < DMA_BIT_MASK(sizeof(dma_addr_t) * 8)) {
>>> + lo = iova_pfn(iovad, tmp_dma_addr);
>>> + hi = iova_pfn(iovad,
>>> + DMA_BIT_MASK(sizeof(dma_addr_t) * 8) - 1);
>>> + reserve_iova(iovad, lo, hi);
>>> + }
>>> + }
>>> }
>>>
>>> /**
>>>
>>