Hi Jonathan,
Le dim., nov. 21 2021 at 15:00:37 +0000, Jonathan Cameron <jic23@xxxxxxxxxx> a écrit :On Mon, 15 Nov 2021 14:19:21 +0000
Paul Cercueil <paul@xxxxxxxxxxxxxxx> wrote:
We can be certain that the input buffers will only be accessed by
userspace for reading, and output buffers will mostly be accessed by
userspace for writing.
Mostly? Perhaps a little more info on why that's not 'only'.
Just like with a framebuffer, it really depends on what the application does. Most of the cases it will just read sequentially an input buffer, or write sequentially an output buffer. But then you get the exotic application that will try to do something like alpha blending, which means read+write. Hence "mostly".
Therefore, it makes more sense to use only fully cached input buffers,
and to use the write-combine cache coherency setting for output buffers.
This boosts performance, as the data written to the output buffers does
not have to be sync'd for coherency. It will halve performance if the
userspace application tries to read from the output buffer, but this
should never happen.
Since we don't need to sync the cache when disabling CPU access either
for input buffers or output buffers, the .end_cpu_access() callback can
be dropped completely.
We have an odd mix of coherent and non coherent DMA in here as you noted,
but are you sure this is safe on all platforms?
The mix isn't safe, but using only coherent or only non-coherent should be safe, yes.
Signed-off-by: Paul Cercueil <paul@xxxxxxxxxxxxxxx>
Any numbers to support this patch? The mapping types are performance
optimisations so nice to know how much of a difference they make.
Output buffers are definitely faster in write-combine mode. On a ZedBoard with a AD9361 transceiver set to 66 MSPS, and buffer/size set to 8192, I would get about 185 MiB/s before, 197 MiB/s after.
Input buffers... early results are mixed. On ARM32 it does look like it is slightly faster to read from *uncached* memory than reading from cached memory. The cache sync does take a long time.
Other architectures might have a different result, for instance on MIPS invalidating the cache is a very fast operation, so using cached buffers would be a huge win in performance.
Setups where the DMA operations are coherent also wouldn't require any cache sync and this patch would give a huge win in performance.
I'll run some more tests next week to have some fresh numbers.