Need help mapping pre-reserved *cacheable* DMA buffer on Xilinx/ARM SoC (Zynq 7000)

From: Timothy Normand Miller
Date: Tue Jan 19 2016 - 13:27:54 EST

I've got a Xilinx Zynq 7000-based board with a peripheral in the FPGA
fabric that has DMA capability (on an AXI bus). We're having
performance problems accessing a DMA buffer.


We have pre-reserved at boot time a section of DRAM for use as a large
DMA buffer. We're apparently using the wrong APIs to map this buffer,
because it appears to be uncached, and the access speed is terrible.

Using it even as a bounce-buffer is untenably slow due to horrible
performance. IIUC, ARM caches are not DMA coherent, so I would really
appreciate some insight on how to do the following:

(1) Map a region of DRAM into the kernel virtual address space but
ensure that it is CACHEABLE.
(2) Ensure that mapping it into userspace doesn't also have an
undesirable effect, even if that requires we provide an mmap call by
our own driver.
(3) Explicitly invalidate a region of physical memory from the cache
hierarchy before doing a DMA, to ensure coherency.

More info:

I've been trying to do due diligence here before mailing the list.
Unfortunately, this being an ARM SoC/FPGA, there's very little
information available on this, so I have to ask the experts directly.

Since this is an SoC, a lot of stuff is hard-coded for u-boot. For
instance, the kernel and a ramdisk are loaded to specific places in
DRAM before handing control over to the kernel. We've taken advantage
of this to reserve a 64MB section of DRAM for a DMA buffer (it does
need to be that big, which is why we pre-reserve it). There isn't any
worry about conflicting memory types or the kernel stomping on this
memory, because the boot parameters tell the kernel what region of
DRAM it has control over.

Initially, we tried to map this physical address range into kernel
space using ioremap, but that appears to mark the region uncacheable,
and the access speed is horrible, even if we try to use memcpy to make
it a bounce buffer. We use /dev/mem to map this also into userspace,
and I've timed memcpy as being around 70MB/sec.

Based on a fair amount of searching on this topic, it appears that
although half the people out there want to use ioremap like this
(which is probably where we got the idea from), ioremap is not
supposed to be used for this purpose and that there are DMA-related
APIs that should be used instead. Unfortunately, it appears that DMA
buffer allocation is totally dynamic, and I haven't figured out how to
tell it, "here's a physical address already allocated -- use that."

One document I looked at is this one, but it's way too x86 and PC-centric:

And this question also comes up at the top of my searches, but there's
no real answer:

Looking at the standard calls, dma_set_mask_and_coherent and family
won't take a pre-defined address and wants a device structure for PCI.
I don't have such a structure, because this is an ARM SoC without PCI.
I could manually populate such a structure, but that smells to me like
abusing the API, not using it as intended.

BTW: This is a ring buffer, where we DMA data blocks into different
offsets, but we align to cache line boundaries, so there is no risk of
false sharing.

Thank you a million for any help you can provide!

Timothy Normand Miller, PhD
Assistant Professor of Computer Science, Binghamton University
Open Graphics Project