Re: [PATCH 03/25] dma-direct: take dma-ranges/offsets into account in resource mapping
From: Serge Semin
Date: Sun Apr 24 2022 - 17:47:07 EST
On Thu, Apr 21, 2022 at 09:51:31PM +0100, Robin Murphy wrote:
> On 2022-04-21 18:35, Serge Semin wrote:
> > On Thu, Apr 21, 2022 at 04:45:36PM +0200, Christoph Hellwig wrote:
> > > On Wed, Apr 20, 2022 at 11:55:38AM +0300, Serge Semin wrote:
> > > > On Wed, Apr 20, 2022 at 10:47:46AM +0200, Christoph Hellwig wrote:
> > > > > I can't really comment on the dma-ranges exlcusion for P2P mappings,
> > > > > as that predates my involvedment, however:
> > > >
> > > > My example wasn't specific to the PCIe P2P transfers, but about PCIe
> > > > devices reaching some platform devices over the system interconnect
> > > > bus.
> > >
> > > So strike PCIe, but this our definition of Peer to Peer accesses.
> > >
> > > > What if I get to have a physical address of a platform device and want
> > > > have that device being accessed by a PCIe peripheral device? The
> > > > dma_map_resource() seemed very much suitable for that. But considering
> > > > what you say it isn't.
> > >
> > > dma_map_resource is the right thing for that. But the physical address
> > > of MMIO ranges in the platform device should not have struct pages
> > > allocated for it, and thus the other dma_map_* APIs should not work on
> > > it to start with.
> > The problem is that the dma_map_resource() won't work for that, but
> > presumably the dma_map_sg()-like methods will (after some hacking with
> > the phys address, but anyway). Consider the system diagram in my
> > previous email. Here is what I would do to initialize a DMA
> > transaction between a platform device and a PCIe peripheral device:
> > 1) struct resource *rsc = platform_get_resource(plat_dev, IORESOURCE_MEM, 0);
> > 2) dma_addr_t dar = dma_map_resource(&pci_dev->dev, rsc->start, rsc->end - rsc->start + 1,
> > DMA_FROM_DEVICE, 0);
> > 3) dma_addr_t sar;
> > void *tmp = dma_alloc_coherent(&pci_dev->dev, PAGE_SIZE, &sar, GFP_KERNEL);
> > memset(tmp, 0xaa, PAGE_SIZE);
> > 4) PCIe device: DMA.DAR=dar, DMA.SAR=sar. RUN.
> > If there is no dma-ranges specified in the PCIe Host controller
> > DT-node, the PCIe peripheral devices will see the rest of the system
> > memory as is (no offsets and remappings). But if there is dma-ranges
> > with some specific system settings it may affect the PCIe MRd/MWr TLPs
> > address translation including the addresses targeted to the MMIO
> > space. In that case the mapping performed on step 2) will return a
> > wrong DMA-address since the corresponding dma_direct_map_resource()
> > just returns the passed physical address missing the
> > 'pci_dev->dma_range_map'-based mapping performed in
> > translate_phys_to_dma().
> > Note the mapping on step 3) works correctly because it calls the
> > translate_phys_to_dma() of the direct DMA interface thus taking the
> > PCie dma-ranges into account.
> > To sum up as I see it either restricting dma_map_resource() to map
> > just the intra-bus addresses was wrong or there must be some
> > additional mapping infrastructure for the denoted systems. Though I
> > don't see a way the dma_map_resource() could be fixed to be suitable
> > for each considered cases.
> FWIW the current semantics of dma_map_resource() are basically just to
> insert IOMMU awareness where dmaengine drivers were previously just casting
> phys_addr_t to dma_addr_t (or u32, or whatever else they put into their
> descriptor/register/etc.) IIRC there was a bit of a question whether it
> really belonged in the DMA API at all, since it's not really a "DMA"
> operation in the conventional sense, and convenience was the only real
> deciding argument. The relevant drivers at the time were not taking
> dma_pfn_offset into account when consuming physical addresses directly, so
> the new API didn't either.
> That's just how things got to where they are today.
I see. Thanks for the clarification. Right, IOMMU is the only reason
to have the current dma_map_resource() implementation.
> Once again, I'm not
> saying that what we have now is necessarily right, or that your change is
> necessarily wrong, I just really want to understand specifically *why* you
> need to make it, so we can evaluate the risk of possible breakage either
> way. Theoretical "if"s aren't really enough.
As I already said our SoC has the next structure (obviously the
diagram is very simplified, but the gist is the same):
| DDR |
+-----+ +------+-------+ +---------+
| CPU +-+ Interconnect +-+ DEVs... |
+-----+ +-----^-+------+ +---------+
| PCI |
The PCIe peripheral devices are connected to the rest of the system
via the DW PCIe Host controller. If the controller has the inbound
iATU configured to re-map the system memory (RAM, IOMEM) in a
non-one-to-one way (using the dma-ranges DT property of the PCIe host
controller), then all the PCIe bus MRd/MWr TLP addresses will be
accordingly translated on a way to all the connected to the
interconnect slave devices including the MMIO devices. The kernel DMA
API at this moment provides the only methods to get the PCIe-bus
visible RAM addresses, while the physical addresses (for instance of
the MMIO devices) can't be correctly translated for such case. I
thought that dma_map_resource() could do the trick, but it turned out
it didn't get the dma-ranges mapping into account.
To be fully honest currently we don't really have any platform which
would have had the strong requirement of doing DMA from the PCIe
peripheral devices to the platform devices. But since the PCIe bus is
the extendable bus (cold and hot pluggable) then such requirement may
arise in the practice for instance on a platform with the PCIe NTB
device attached to the PCIe bus and configured to access the system
MMIO devices via the bridge. That part I find potentially problematic
seeing the practical usecase is unsupported just due to the incomplete
API. Moreover the dma_direct_map_resource() method semantic being
different from the rest of the direct DMA mapping methods doesn't seem
right from the usability point of view. Finally as you can see having
the dma_direct_map_resource() defined as MMIO-specific doesn't
mean that the dma_pfn_offset-based mapping isn't supposed to be taken