Re: [PATCH 1/4] Intel pci: Remove Host Bridge devices from identitymapping

From: Mike Travis
Date: Wed Mar 30 2011 - 15:25:53 EST




Chris Wright wrote:
* Mike Travis (travis@xxxxxxx) wrote:
Chris Wright wrote:
* Mike Travis (travis@xxxxxxx) wrote:
When the IOMMU is being used, each request for a DMA mapping requires
the intel_iommu code to look for some space in the DMA mapping table.
For most drivers this occurs for each transfer.

When there are many outstanding DMA mappings [as seems to be the case
with the 10GigE driver], the table grows large and the search for
space becomes increasingly time consuming. Performance for the
10GigE driver drops to about 10% of it's capacity on a UV system
when the CPU count is large.
That's pretty poor. I've seen large overheads, but when that big it was
also related to issues in the 10G driver. Do you have profile data
showing this as the hotspot?
Here's one from our internal bug report:

Here is a profile from a run with iommu=on iommu=pt (no forcedac)

OK, I was actually interested in the !pt case. But this is useful
still. The iova lookup being distinct from the identity_mapping() case.

I can get that as well, but having every device using maps caused it's
own set of problems (hundreds of dma maps). Here's a list of devices
on the system under test. You can see that even 'minor' glitches can
get magnified when there are so many...

Blade Location NASID PCI Address X Display Device
----------------------------------------------------------------------
0 r001i01b00 0 0000:01:00.0 - Intel 82576 Gigabit Network Connection
. . . 0000:01:00.1 - Intel 82576 Gigabit Network Connection
. . . 0000:04:00.0 - LSI SAS1064ET Fusion-MPT SAS
. . . 0000:05:00.0 - Matrox MGA G200e
2 r001i01b02 4 0001:02:00.0 - Mellanox MT26428 InfiniBand
3 r001i01b03 6 0002:02:00.0 - Mellanox MT26428 InfiniBand
4 r001i01b04 8 0003:02:00.0 - Mellanox MT26428 InfiniBand
11 r001i01b11 22 0007:02:00.0 - Mellanox MT26428 InfiniBand
13 r001i01b13 26 0008:02:00.0 - Mellanox MT26428 InfiniBand
15 r001i01b15 30 0009:07:00.0 :0.0 nVidia GF100 [Tesla S2050]
. . . 0009:08:00.0 :1.1 nVidia GF100 [Tesla S2050]
18 r001i23b02 36 000b:02:00.0 - Mellanox MT26428 InfiniBand
20 r001i23b04 40 000c:01:00.0 - Intel 82599EB 10-Gigabit Network Connection
. . . 000c:01:00.1 - Intel 82599EB 10-Gigabit Network Connection
. . . 000c:04:00.0 - Mellanox MT26428 InfiniBand
23 r001i23b07 46 000d:07:00.0 - nVidia GF100 [Tesla S2050]
. . . 000d:08:00.0 - nVidia GF100 [Tesla S2050]
25 r001i23b09 50 000e:01:00.0 - Intel 82599EB 10-Gigabit Network Connection
. . . 000e:01:00.1 - Intel 82599EB 10-Gigabit Network Connection
. . . 000e:04:00.0 - Mellanox MT26428 InfiniBand
26 r001i23b10 52 000f:02:00.0 - Mellanox MT26428 InfiniBand
27 r001i23b11 54 0010:02:00.0 - Mellanox MT26428 InfiniBand
29 r001i23b13 58 0011:02:00.0 - Mellanox MT26428 InfiniBand
31 r001i23b15 62 0012:02:00.0 - Mellanox MT26428 InfiniBand
34 r002i01b02 68 0013:01:00.0 - Mellanox MT26428 InfiniBand
35 r002i01b03 70 0014:02:00.0 - Mellanox MT26428 InfiniBand
36 r002i01b04 72 0015:01:00.0 - Mellanox MT26428 InfiniBand
41 r002i01b09 82 0018:07:00.0 - nVidia GF100 [Tesla S2050]
. . . 0018:08:00.0 - nVidia GF100 [Tesla S2050]
43 r002i01b11 86 0019:01:00.0 - Mellanox MT26428 InfiniBand
45 r002i01b13 90 001a:01:00.0 - Mellanox MT26428 InfiniBand
48 r002i23b00 96 001c:07:00.0 - nVidia GF100 [Tesla S2050]
. . . 001c:08:00.0 - nVidia GF100 [Tesla S2050]
50 r002i23b02 100 001d:02:00.0 - Mellanox MT26428 InfiniBand
52 r002i23b04 104 001e:01:00.0 - Intel 82599EB 10-Gigabit Network Connection
. . . 001e:01:00.1 - Intel 82599EB 10-Gigabit Network Connection
. . . 001e:04:00.0 - Mellanox MT26428 InfiniBand
57 r002i23b09 114 0020:01:00.0 - Intel 82599EB 10-Gigabit Network Connection
. . . 0020:01:00.1 - Intel 82599EB 10-Gigabit Network Connection
. . . 0020:04:00.0 - Mellanox MT26428 InfiniBand
58 r002i23b10 116 0021:02:00.0 - Mellanox MT26428 InfiniBand
59 r002i23b11 118 0022:02:00.0 - Mellanox MT26428 InfiniBand
61 r002i23b13 122 0023:02:00.0 - Mellanox MT26428 InfiniBand
63 r002i23b15 126 0024:02:00.0 - Mellanox MT26428 InfiniBand


uv48-sys was receiving and uv-debug sending.
ksoftirqd/640 was running at approx. 100% cpu utilization.
I had pinned the nttcp process on uv48-sys to cpu 64.

# Samples: 1255641
#
# Overhead Command Shared Object Symbol
# ........ ............. ............. ......
#
50.27%ESC[m ksoftirqd/640 [kernel] [k] _spin_lock
27.43%ESC[m ksoftirqd/640 [kernel] [k] iommu_no_mapping

...
0.48% ksoftirqd/640 [kernel] [k] iommu_should_identity_map
0.45% ksoftirqd/640 [kernel] [k] ixgbe_alloc_rx_buffers [
ixgbe]

Note, ixgbe has had rx dma mapping issues (that's why I wondered what
was causing the massive slowdown under !pt mode).

I think since this profile run, the network guys updated the ixgbe
driver with a later version. (I don't know the outcome of that test.)


<snip>
I tracked this time down to identity_mapping() in this loop:

list_for_each_entry(info, &si_domain->devices, link)
if (info->dev == pdev)
return 1;

I didn't get the exact count, but there was approx 11,000 PCI devices
on this system. And this function was called for every page request
in each DMA request.

Right, so this is the list traversal (and wow, a lot of PCI devices).

Most of the PCI devices were the 45 on each of 256 Nahalem sockets.
Also, there's a ton of bridges as well.

Did you try a smarter data structure? (While there's room for another
bit in pci_dev, the bit is more about iommu implementation details than
anything at the pci level).

Or the domain_dev_info is cached in the archdata of device struct.
You should be able to just reference that directly.

Didn't think it through completely, but perhaps something as simple as:

return pdev->dev.archdata.iommu == si_domain;

I can try this, thanks!


thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/