On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:Finally, with my email filters fixed, I can see this email... :)On 05/15/2012 05:09 PM, Alex Williamson wrote:On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:+1; there is a global in the PCI code, pci_acs_enable,On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
<alex.williamson@xxxxxxxxxx> wrote:
On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
<alex.williamson@xxxxxxxxxx> wrote:
In a PCIe environment, transactions aren't always required to
reach the root bus before being re-routed. Peer-to-peer DMA
may actually not be seen by the IOMMU in these cases. For
IOMMU groups, we want to provide IOMMU drivers a way to detect
these restrictions. Provided with a PCI device, pci_acs_enabled
returns the furthest downstream device with a complete PCI ACS
chain. This information can then be used in grouping to create
fully isolated groups. ACS chain logic extracted from libvirt.
The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
Right, maybe this should be:
struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
and a function pci_enable_acs(), which the above name certainly
confuses. I recommend pci_find_top_acs_bridge()
would be most descriptive.
ok.
Yep, the new API I'm working with is:
bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
bool pci_acs_path_enabled(struct pci_dev *start,
struct pci_dev *end, u16 acs_flags);
In section 6.12.1.1 of PCIe Base spec, rev 3.0, it states:detail: PCIe up/downstream routing is really done by an internal switch;I'm not sure what "a complete PCI ACS chain" means.
The function starts from "dev" and searches *upstream*, so I'm
guessing it returns the root of a subtree that must be contained in a
group.
Any intermediate switch between an endpoint and the root bus can
redirect a dma access without iommu translation,
Is this "redirection" just the normal PCI bridge forwarding that
allows peer-to-peer transactions, i.e., the rule (from P2P bridge
spec, rev 1.2, sec 4.1) that the bridge apertures define address
ranges that are forwarded from primary to secondary interface, and the
inverse ranges are forwarded from secondary to primary? For example,
here:
^
|
+--------+-------+
| |
+------+-----+ +-----++-----+
| Downstream | | Downstream |
| Port | | Port |
| 06:05.0 | | 06:06.0 |
+------+-----+ +------+-----+
| |
+----v----+ +----v----+
| Endpoint| | Endpoint|
| 07:00.0 | | 08:00.0 |
+---------+ +---------+
that rule is all that's needed for a transaction from 07:00.0 to be
forwarded from upstream to the internal switch bus 06, then claimed by
06:06.0 and forwarded downstream to 08:00.0. This is plain old PCI,
nothing specific to PCIe.
Right, I think the main PCI difference is the point-to-point nature of
PCIe vs legacy PCI bus. On a legacy PCI bus there's no way to prevent
devices talking to each other, but on PCIe the transaction makes a
U-turn at some point and heads out another downstream port. ACS allows
us to prevent that from happening.
ACS forces the legacy, PCI base-limit address routing and *forces*
the switch to always route the transaction from a downstream port
to the upstream port.
+1I don't understand ACS very well, but it looks like it basically
provides ways to prevent that peer-to-peer forwarding, so transactions
would be sent upstream toward the root (and specifically, the IOMMU)
instead of being directly claimed by 06:06.0.
Yep, that's my meager understanding as well.
Correct. PCIe spec says roots must support ACS. I believe all theso we're looking for
the furthest upstream device for which acs is enabled all the way up to
the root bus.
Correct me if this is wrong: To force device A's DMAs to be processed
by an IOMMU, ACS must be enabled on the root port and every downstream
port along the path to A.
Yes, modulo this comment in libvirt source:
/* if we have no parent, and this is the root bus, ACS doesn't come
* into play since devices on the root bus can't P2P without going
* through the root IOMMU.
*/
root bridges that have an IOMMU have ACS wired in/on.
Would you mind looking for the paragraph that says this? I'd rather
code this into the iommu driver callers than core PCI code if this is
just a platform standard.
ATM, VT-d IOMMUs are only in the RCs, so, ACS at each downstream portSo we assume that a redirect at the point of the iommu will factor insee comment above wrt root ports that have IOMMUs in them.
iommu translation.
If so, I think you're trying to find out the closest upstream device X
such that everything leading to X has ACS enabled. Every device below
X can DMA freely to other devices below X, so they would all have to
be in the same isolated group.
Yes
I tried to work through some examples to develop some intuition about this:
(inserting fixed url)http://www.asciiflow.com/#3736558963405980039
pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
if 00:00.0 is PCIe or if RP has ACS?))
Hmm, the latter is the assumption above. For the former, I think
libvirt was probably assuming that PCI devices must have a PCIe device
upstream from them because x86 doesn't have assignment friendly IOMMUs
except on PCIe. I'll need to work on making that more generic.
pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
PCIe; seems wrong)
Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
input devices, so this was passing for me. I'll need to incorporate
that generically.
pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
doesn't have ACS)
Yeah, let me validate the libvirt assumption. I see ACS on my root
port, so maybe they're just assuming it's always enabled or that the
precedence favors IOMMU translation. I'm also starting to think that we
might want "from" and "to" struct pci_dev parameters to make it more
flexible where the iommu lives in the system.
Except it really seems to be platform convention where the IOMMU lives.
The DMAR for VT-d describes which devices and hierarchies a DRHD is used
for and from that we can make assumptions about where it physically
lives. AMD-Vi exposes a PCI device as the IOMMU, but it's just a peer
function on the root bus. For now I'm just allowing
pci_acs_path_enabled to take NULL for and end, which means "up to the
root bus".
Well, in digging into the ACS-support in Root port question above,
correct. ACS is a *bridge* property.pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
a bridge; seems wrong if 04:00 is a multi-function device)
AIUI, ACS is not an endpoint property, so this is what should happen. I
don't think multifunction plays a role other than how much do we trust
the implementation to not allow back channels between functions (the
answer should probably be not at all).
The unknown wrt multifunction devices is that such devices *could* be implemented
by a hidden (not responding to PCI cfg accesses from downstream port) PCI bridge
btwn the functions within a device.
Such a bridge could allow peer-to-peer xactions and there is no way for OS's to
force ACS. So, one has to ask the hw vendors if such a hidden device exists
in the implementation, and whether peer-to-peer is enabled/allowed -- a hidden PCI
bridge/PCIe-switch could just be hardwired to push all IO to upstream port,
and allow parent bridge re-route it back down if peer-to-peer is desired.
Debate exists whether multifunction devices are 'secure' b/c of this unknown.
Maybe a PCIe (min., SRIOV) spec change is needed in this area to
determine this status about a device (via pci cfg/cap space).
Well, there is actually a section of the ACS part of the spec
identifying valid flags for multifunction devices. Secretly I'd like to
use this as justification for blacklisting all multifunction devices
that don't explicitly support ACS, but that makes for pretty course
granularity. For instance, all these devices end up in a group:
+-14.0 ATI Technologies Inc SBx00 SMBus Controller
+-14.2 ATI Technologies Inc SBx00 Azalia (Intel HDA)
+-14.3 ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller
+-14.4-[05]----0e.0 VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller
00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)
And these in another:
+-15.0-[06]----00.0 Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
+-15.1-[07]----00.0 Etron Technology, Inc. EJ168 USB 3.0 Host Controller
+-15.2-[08]--
+-15.3-[09]--
00:15.0 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0)
00:15.1 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 1)
00:15.2 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 2)
00:15.3 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 3)
Am I misinterpreting the spec or is this the correct, if strict,
interpretation?
Alex