Re: igb driver can cause cache invalidation of non-owned memory?
From: Nikita Yushchenko
Date: Wed Oct 12 2016 - 02:56:11 EST
>>> The main reason why this isn't a concern for the igb driver is because
>>> we currently pass the page up as read-only. We don't allow the stack
>>> to write into the page by keeping the page count greater than 1 which
>>> means that the page is shared. It isn't until we unmap the page that
>>> the page count is allowed to drop to 1 indicating that it is writable.
>> Doesn't that mean that sync_to_device() in igb_reuse_rx_page() can be
>> avoided? If page is read only for entire world, then it can't be dirty
>> in cache and thus device can safely write to it without preparation step.
> For the sake of correctness we were adding the
Could you please elaborate this "for sake of correctness"?
If by "correctness" you mean ensuring that buffer gets frame DMAed by
device and that's not broken by cache activity, then:
- on first use of this buffer after page allocation, sync_for_device()
is not needed due to previous dma_page_map() call,
- on later uses of the same buffer, sync_for_device() is not needed due
to buffer being read-only since dma_page_map() call, thus it can't be
dirty in cache and thus no writebacks of this area can be possible.
If by "correctness" you mean strict following "ownership" concept - i.e.
memory area is "owned" either by cpu or by device, and "ownersip" must
be passed to device before DMA and back to cpu after DMA - then, igb
driver already breaks these rules anyway:
- igb calls dma_map_page() at page allocation time, thus entire page
becomes "owned" by device,
- and then, on first use of second buffer inside the page, igb calls
sync_for_device() for buffer area, despite of that area is already
"owned" by device,
- and later, if a buffer within page gets reused, igb calls
sync_for_device() for entire buffer, despite of only part of buffer was
sync_for_cpu()'ed at time of completing receive of previous frame into
- and later, igb calls dma_unmap_page(), despite of that part of page
was sync_for_cpu()'ed and thus is "owned" by CPU.
Given all that, not calling sync_for_device() before reusing buffer
won't make picture much worse :)
> Since it is an DMA_FROM_DEVICE
> mapping calling it should really have no effect for most DMA mapping
Unfortunately dma_sync_single_range_for_device() *is* slow on imx6q - it
does cache invalidation. I don't really understand why invalidating
cache can be slow - it only removes data from cache, it should not
access slow outer memory - but cache invalidation *is* in top of perf
To get some throughput improvement, I propose removal of that
sync_for_device() before reusing buffer. Will you accept such a patch ;)
> Also you may want to try updating to the 4.8 version of the driver.
> It reduces the size of the dma_sync_single_range_for_cpu loops by
> reducing the sync size down to the size that was DMAed into the
Actually that patch came out of the work I'm currently participating in
;). Sure I have it.
> Specifically I believe
> the 0->100% accounting problem is due to the way this is all tracked.
Thanks for this hint - shame on me not realizing this earlier...
> You may want to try pulling the most recent net-next kernel and
> testing that to see if you still see the same behavior as Eric has
> recently added a fix that is meant to allow for better sharing between
> softirq polling and applications when dealing with stuff like UDP
> As far as identifying the problem areas your best bet would be to push
> the CPU to 100% and then identify the hot spots.
Thanks for hints