Correct use of DMA api (Some newbie questions)
From: Nikolai Zhubr
Date: Sun Jul 14 2019 - 13:01:37 EST
Hi all,
After reading some (apparently contradictory) revisions of DMA api
references in Documentation/DMA-*.txt, some (contradictory) discussions
thereof, and even digging through the in-tree drivers in search for a
good enlightening example, still I have to ask for advice.
I'm crafting a tiny driver (or rather, a kernel-mode helper) for a very
special PCIe device. And actually it does work already, but performs
differenly on different kernels. I'm targeting x86 (i686) only (although
preferrably the driver should stay platform-neutral) and I need to
support kernels 4.9+. Due to how the device is designed and used, very
little has to be done in kernel space. The device has large internal
memory, which accumulates some measurement data, and it is capable of
transferring it to the host using DMA (with at least 32-bit address
space available). Arranging memory for DMA is pretty much the only thing
that userspace can not reasonably do, so this needs to be in the driver.
So my currenly attempted layout is as follows:
1. In the (kernel-mode) driver, allocate large contiguous block of
physical memory to do DMA into. It will be later reused several times.
This block does not need to have a kernel-mode virtual address because
it will never be accessed from the driver directly. The block size is
typically 128M and I use CMA=256M. Currently I use dma_alloc_coherent(),
but I'm not convinced it really needs to be a strictly coherent memory,
for performance reasons, see below. Also, AFAICS on x86
dma_alloc_coherent() always creates a kernel address mapping anyway, so
maybe I'd better simply kalloc() with subsequent dma_map_single()?
2. Upon DMA completion (from device to host), some sort of
barrier/synchronization might be necessary (to be safe WRT speculative
loads, cache, etc), like dma_cache_sync() or dma_sync_single_for_cpu(),
however the latter looks like a nop for x86 AFAICS, and the former is
apparently flush_write_buffers() which is not very involved either (asm
lock; nop) and does not look usefull for my case. Currentlly, I do not
use any, and it seems like OK, maybe by pure luck. So, is it so
trivially simple on x86 or am I just missing something horribly big here?
3. mmap this buffer for userspace. Reading from it should be as fast as
possible, therefore this block AFAICS should be cacheble (and
prefetchable and whatever else for better performance), at least from
userspace context. It is not quite clear if such properties would depend
on block allocation method (in step 1 above) or just on remapping
attributes only. Currently, for mmap I employ dma_mmap_coherent(), but
it seems also possible to use remap_pfn_range(), and also change
vm_page_prot somewhat. I've already found that e.g. pgprot_noncached
hurts performance quite a lot, but supposedly without it some DMA
barrier (step 2 above) seems still necessary?
Any hints greatly appreciated,
Regards,
Nikolai