On 2023-09-26 13:32, Christian König wrote:
Am 26.09.23 um 06:33 schrieb Kelly Devilliv:
On 2023-09-26 01:58, Christian König wrote:
Am 25.09.23 um 16:17 schrieb Kelly Devilliv:Thanks Christian. That's true.
On 2023-09-25 19:16, Robin Murphy wrote:That's not an issue, but expected behavior.
On 2023-09-25 04:59, Kelly Devilliv wrote:Thanks Robin.
Dear all,Not really. Other use-cases for dma_map_resource() include DMA
I am working on an ARM-V8 server with two gpu cards on it.
Recently, I need to test pcie peer to peer communication between
the two gpu cards, but the throughput is only 4GB/s.
After I explored the gpu's kernel mode driver, I found it was
using the dma_map_resource() API to map the peer device's MMIO
space. The arm iommu driver then will hardcode a 'IOMMU_MMIO' prot in the later dma map:
static dma_addr_t iommu_dma_map_resource(struct device
*dev, phys_addr_t phys,
size_t size, enum
dma_data_direction dir, unsigned long attrs)
{
return __iommu_dma_map(dev, phys, size,
dma_info_to_prot(dir,
false,
attrs) | IOMMU_MMIO,
dma_get_mask(dev));
}
And that will finally set the 'ARM_LPAE_PTE_MEMATTR_DEV' attribute
in PTE, which may have a negative impact on the performance of the
pcie peer to peer transactions.
/*
* Note that this logic is structured to accommodate Mali LPAE
* having stage-1-like attributes but stage-2-like permissions.
*/
if (data->iop.fmt == ARM_64_LPAE_S2 ||
data->iop.fmt == ARM_32_LPAE_S2) {
if (prot & IOMMU_MMIO)
pte |= ARM_LPAE_PTE_MEMATTR_DEV;
else if (prot & IOMMU_CACHE)
pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
else
pte |= ARM_LPAE_PTE_MEMATTR_NC;
} else {
if (prot & IOMMU_MMIO)
pte |= (ARM_LPAE_MAIR_ATTR_IDX_DEV
<< ARM_LPAE_PTE_ATTRINDX_SHIFT);
else if (prot & IOMMU_CACHE)
pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
<< ARM_LPAE_PTE_ATTRINDX_SHIFT);
}
I tried to remove the 'IOMMU_MMIO' prot in the dma_map_resource()
API and re-compile the linux kernel, the throughput then can be up
to 28GB/s.
Is there an elegant way to solve this issue without modifying the linux kernel?
e.g., a substitution of dma_map_resource() API?
offload engines accessing FIFO registers, where allowing
reordering, write-gathering, etc. would be a terrible idea. Thus it
needs to assume a "safe" MMIO memory type, which on Arm means Device-nGnRE.
However, the "proper" PCI peer-to-peer support under
CONFIG_PCI_P2PDMA ended up moving away from the dma_map_resource()
approach anyway, and allows this kind of device memory to be
treated more like regular memory (via
ZONE_DEVICE) rather than arbitrary MMIO resources, so your best bet
would be to get the GPU driver converted over to using that.
So your suggestion is we'd better work out a new implementation just
as what it does under CONFIG_PCI_P2PDMA instead of just using the
dma_map_resource() API?
I have explored the GPU drivers from AMD, Nvidia and habanalabs,
e.g., and found they all using the dma_map_resource() API to map the
peer device's bar address.
If so, is it possible to be a common performance issue in PCI
peer-to-peer scenario?
When you enable IOMMU every transaction needs to go through the root
complex for address translation and you completely lose the
performance benefit of PCIe P2P.
This is a hardware limitation and not really related toBut when I removed the 'IOMMU_MMIO' prot in dma_map_resource(), the
dma_map_resource() in any way.
performace was significantly improved (from 4GB/s to 28GB/s), which was
almost the same as what it can be when IOMMU disabled. So I guess in my common pci topology,
what really matters may not be whether IOMMU is enabled or not, but in fact the attributes in dma mapping or ARM PTE does.
The key point is that nobody really supports that configuration, so you probably
will find nobody looking into it.
BTW: ARM isn't really supported as a platform for amdgpu either. E.g. we have
seen tons of boards which implement the PCIe standard incorrectly, if you run
into any trouble with that you are pretty much on your own.
Thanks Christian. I am going to disable IOMMU or do some tricks in PCI peer-to-peer scenario.
I don't know if there is a way to make the memory attributes more configurable in order to be distinguished
from the "safe" MMIO memory type, which on Arm means Device-nGnRE as Robin said.
Well we would need to extend dma_map_resource() to include some use case
so that the mapping attributes don't need to be guessed.
Hi Robin,
Is there any chance to extend the dma_map_resource() API as discussed above?