Re: iommu_iova leak

From: Chris Boot
Date: Sat Sep 17 2011 - 07:58:06 EST


On 17 Sep 2011, at 11:45, Woodhouse, David wrote:
> On Fri, 2011-09-16 at 13:43 +0100, Chris Boot wrote:
>> In the very short term the number is up and down by a few hundred
>> objects but the general trend is constantly upwards. After about 5 days'
>> uptime I have some very serious IO slowdowns (narrowed down by a friend
>> to SCSI command queueing) with a lot of time spent in
>> alloc_iova() and rb_prev() according to 'perf top'. Eventually these
>> translate into softlockups and the machine becomes almost unusable.
>
> If you're seeing it spend ages in rb_prev() that implies that the
> mappings are still *active* and in the rbtree, rather than just the the
> iommu_iova data structure has been leaked.
>
> I suppose it's vaguely possible that we're leaking them in such a way
> that they remain on the rbtree, perhaps if the deferred unmap is never
> actually happening... but I think it's a whole lot more likely that the
> PCI driver is just never bothering to unmap the pages it maps.
>
> If you boot with 'intel_iommu=strict' that will avoid the deferred unmap
> which is the only likely culprit in the IOMMU code...


Booting with intel_iommu=on,strict still shows the iommu_iova on a constant increase, so I don't think it's that.

I've bodged the following patch to see if it catches anything obvious. We'll see if anything useful comes of it. Sorry, my mail client kills whitespace.

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index c621c98..aebbd56 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -2724,6 +2724,7 @@ static dma_addr_t __intel_map_single(struct device *hwdev, phys_addr_t paddr,
int ret;
struct intel_iommu *iommu;
unsigned long paddr_pfn = paddr >> PAGE_SHIFT;
+ int dma_map_count;

BUG_ON(dir == DMA_NONE);

@@ -2761,6 +2762,9 @@ static dma_addr_t __intel_map_single(struct device *hwdev, phys_addr_t paddr,
if (ret)
goto error;

+ dma_map_count = atomic_inc_return(&pdev->dma_map_count);
+ WARN_ON((dma_map_count > 2000) && !(dma_map_count % 1000));
+
/* it's a non-present to present mapping. Only flush if caching mode */
if (cap_caching_mode(iommu->cap))
iommu_flush_iotlb_psi(iommu, domain->id, mm_to_dma_pfn(iova->pfn_lo), size, 1);
@@ -2892,6 +2896,7 @@ static void intel_unmap_page(struct device *dev, dma_addr_t dev_addr,

pr_debug("Device %s unmapping: pfn %lx-%lx\n",
pci_name(pdev), start_pfn, last_pfn);
+ atomic_dec(&pdev->dma_map_count);

/* clear the whole page */
dma_pte_clear_range(domain, start_pfn, last_pfn);
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index f3f94a5..cb1e86b 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1108,6 +1108,7 @@ struct pci_dev *alloc_pci_dev(void)
return NULL;

INIT_LIST_HEAD(&dev->bus_list);
+ atomic_set(&dev->dma_map_count, 0);

return dev;
}
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 8c230cb..d431f39 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -331,6 +331,7 @@ struct pci_dev {
int rom_attr_enabled; /* has display of the rom attribute been enabled? */
struct bin_attribute *res_attr[DEVICE_COUNT_RESOURCE]; /* sysfs file for resources */
struct bin_attribute *res_attr_wc[DEVICE_COUNT_RESOURCE]; /* sysfs file for WC mapping of resources */
+ atomic_t dma_map_count;
#ifdef CONFIG_PCI_MSI
struct list_head msi_list;
#endif

--
Chris Boot
bootc@xxxxxxxxx

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/