Re: [PATCH net v1 2/2] lan743x: boost performance: limit PCIe bandwidth requirement
From: Andrew Lunn
Date: Wed Dec 09 2020 - 09:11:22 EST
On Tue, Dec 08, 2020 at 10:49:16PM -0500, Sven Van Asbroeck wrote:
> On Tue, Dec 8, 2020 at 6:36 PM Florian Fainelli <f.fainelli@xxxxxxxxx> wrote:
> >
> > dma_sync_single_for_{cpu,device} is what you would need in order to make
> > a partial cache line invalidation. You would still need to unmap the
> > same address+length pair that was used for the initial mapping otherwise
> > the DMA-API debugging will rightfully complain.
>
> I tried replacing
> dma_unmap_single(9K, DMA_FROM_DEVICE);
> with
> dma_sync_single_for_cpu(received_size=1500 bytes, DMA_FROM_DEVICE);
> dma_unmap_single_attrs(9K, DMA_FROM_DEVICE, DMA_ATTR_SKIP_CPU_SYNC);
>
> and that works! But the bandwidth is still pretty bad, because the cpu
> now spends most of its time doing
> dma_map_single(9K, DMA_FROM_DEVICE);
> which spends a lot of time doing __dma_page_cpu_to_dev.
9K is not a nice number, since for each allocation it probably has to
find 4 contiguous pages. See what the performance difference is with
2K, 4K and 8K. If there is a big difference, you might want to special
case when the MTU is set for jumbo packets, or check if the hardware
can do scatter/gather.
You also need to be careful with caches and speculation. As you have
seen, bad things can happen. And it can be a lot more subtle. If some
code is accessing the page before the buffer and gets towards the end
of the page, the CPU might speculatively bring in the next page, i.e
the start of the buffer. If that happens before the DMA operation, and
you don't invalidate the cache correctly, you get hard to find
corruption.
Andrew