Re: igb driver can cause cache invalidation of non-owned memory?

From: Nikita Yushchenko
Date: Wed Oct 12 2016 - 02:56:11 EST

>>> The main reason why this isn't a concern for the igb driver is because
>>> we currently pass the page up as read-only. We don't allow the stack
>>> to write into the page by keeping the page count greater than 1 which
>>> means that the page is shared. It isn't until we unmap the page that
>>> the page count is allowed to drop to 1 indicating that it is writable.
>> Doesn't that mean that sync_to_device() in igb_reuse_rx_page() can be
>> avoided? If page is read only for entire world, then it can't be dirty
>> in cache and thus device can safely write to it without preparation step.
> For the sake of correctness we were adding the
> dma_sync_single_range_for_device.

Could you please elaborate this "for sake of correctness"?

If by "correctness" you mean ensuring that buffer gets frame DMAed by
device and that's not broken by cache activity, then:
- on first use of this buffer after page allocation, sync_for_device()
is not needed due to previous dma_page_map() call,
- on later uses of the same buffer, sync_for_device() is not needed due
to buffer being read-only since dma_page_map() call, thus it can't be
dirty in cache and thus no writebacks of this area can be possible.

If by "correctness" you mean strict following "ownership" concept - i.e.
memory area is "owned" either by cpu or by device, and "ownersip" must
be passed to device before DMA and back to cpu after DMA - then, igb
driver already breaks these rules anyway:
- igb calls dma_map_page() at page allocation time, thus entire page
becomes "owned" by device,
- and then, on first use of second buffer inside the page, igb calls
sync_for_device() for buffer area, despite of that area is already
"owned" by device,
- and later, if a buffer within page gets reused, igb calls
sync_for_device() for entire buffer, despite of only part of buffer was
sync_for_cpu()'ed at time of completing receive of previous frame into
this buffer,
- and later, igb calls dma_unmap_page(), despite of that part of page
was sync_for_cpu()'ed and thus is "owned" by CPU.

Given all that, not calling sync_for_device() before reusing buffer
won't make picture much worse :)

> Since it is an DMA_FROM_DEVICE
> mapping calling it should really have no effect for most DMA mapping
> interfaces.

Unfortunately dma_sync_single_range_for_device() *is* slow on imx6q - it
does cache invalidation. I don't really understand why invalidating
cache can be slow - it only removes data from cache, it should not
access slow outer memory - but cache invalidation *is* in top of perf

To get some throughput improvement, I propose removal of that
sync_for_device() before reusing buffer. Will you accept such a patch ;)

> Also you may want to try updating to the 4.8 version of the driver.
> It reduces the size of the dma_sync_single_range_for_cpu loops by
> reducing the sync size down to the size that was DMAed into the
> buffer.

Actually that patch came out of the work I'm currently participating in
;). Sure I have it.

> Specifically I believe
> the 0->100% accounting problem is due to the way this is all tracked.

Thanks for this hint - shame on me not realizing this earlier...

> You may want to try pulling the most recent net-next kernel and
> testing that to see if you still see the same behavior as Eric has
> recently added a fix that is meant to allow for better sharing between
> softirq polling and applications when dealing with stuff like UDP
> traffic.
> As far as identifying the problem areas your best bet would be to push
> the CPU to 100% and then identify the hot spots.

Thanks for hints