On Wed, May 22, 2019 at 01:51:01PM +0200, RafaÅ MiÅecki wrote:
On 21.05.2019 12:45, Russell King - ARM Linux admin wrote:> On Tue, May 21, 2019 at 12:28:48PM +0200, RafaÅ MiÅecki wrote:
I work on home routers based on Broadcom's Northstar SoCs. Those devices
have ARM Cortex-A9 and most of them are dual-core.
As for home routers, my main concern is network performance. That CPU
isn't powerful enough to handle gigabit traffic so all kind of
optimizations do matter. I noticed some unexpected changes in NAT
performance when switching between kernels.
My hardware is BCM47094 SoC (dual core ARM) with integrated network
controller and external BCM53012 switch.
Guessing, I'd say it's to do with the placement of code wrt cachelines.
You could try aligning some of the cache flushing code to a cache line
and see what effect that has.
Is System.map a good place to check for functions code alignment?
With Linux 4.19 + OpenWrt mtd patches I have:
(...)
c010ea94 t v7_dma_inv_range
c010eae0 t v7_dma_clean_range
(...)
c02ca3d0 T blk_mq_update_nr_hw_queues
c02ca69c T blk_mq_alloc_tag_set
c02ca94c T blk_mq_release
c02ca9b4 T blk_mq_free_queue
c02caa88 T blk_mq_update_nr_requests
c02cab50 T blk_mq_unique_tag
(...)
After cherry-picking 9316a9ed6895 ("blk-mq: provide helper for setting
up an SQ queue and tag set"):
(...)
c010ea94 t v7_dma_inv_range
c010eae0 t v7_dma_clean_range
(...)
c02ca3d0 T blk_mq_update_nr_hw_queues
c02ca69c T blk_mq_alloc_tag_set
c02ca94c T blk_mq_init_sq_queue <-- NEW
c02ca9c0 T blk_mq_release <-- Different address of this & all below
c02caa28 T blk_mq_free_queue
c02caafc T blk_mq_update_nr_requests
c02cabc4 T blk_mq_unique_tag
(...)
As you can see blk_mq_init_sq_queue has appeared in the System.map and
it affected addresses of ~30000 symbols. I can believe some frequently
used symbols got luckily aligned and that improved overall performance.
Interestingly v7_dma_inv_range() and v7_dma_clean_range() were not
relocated.
*****
I followed Russell's suggestion and added .align 5 to cache-v7.S (see
two attached diffs).
1) v4.19 + OpenWrt mtd patches
egrep -B 1 -A 1 "v7_dma_(inv|clean)_range" System.mapc010ea58 T v7_flush_kern_dcache_area
c010ea94 t v7_dma_inv_range
c010eae0 t v7_dma_clean_range
c010eb18 T b15_dma_flush_range
2) v4.19 + OpenWrt mtd patches + two .align 5 in cache-v7.S
c010ea6c T v7_flush_kern_dcache_area
c010eac0 t v7_dma_inv_range
c010eb20 t v7_dma_clean_range
c010eb58 T b15_dma_flush_range
(actually 15 symbols above v7_dma_inv_range were replaced)
This method seems to be somehow working (at least affects addresses in
System.map).
*****
I run 2 tests for each combination of changes. Each test consisted of
10 sequences of: 30 seconds iperf session + reboot.
git reset --hard v4.19Test #1: 738 Mb/s
git am OpenWrt-mtd-chages.patch
Test #2: 737 Mb/s
git reset --hard v4.19patch -p1 < v7_dma_clean_range-align.diff
git am OpenWrt-mtd-chages.patch
Test #1: 746 Mb/s
Test #2: 747 Mb/s
git reset --hard v4.19Test #1: 745 Mb/s
git am OpenWrt-mtd-chages.patch
patch -p1 < v7_dma_inv_range-align.diff
Test #2: 746 Mb/s
git reset --hard v4.19Test #1: 762 Mb/s
git am OpenWrt-mtd-chages.patch
patch -p1 < v7_dma_clean_range-align.diff
patch -p1 < v7_dma_inv_range-align.diff
Test #2: 761 Mb/s
As you can see I got a quite nice performance improvement after aligning
both: v7_dma_clean_range() and v7_dma_inv_range().
This is an improvement of about 3.3%.
It still wasn't as good as with 9316a9ed6895 cherry-picked but pretty
close.
git reset --hard v4.19Test #1: 770 Mb/s
git am OpenWrt-mtd-chages.patch
git cherry-pick -x 9316a9ed6895
Test #2: 766 Mb/s
git reset --hard v4.19Test #1: 756 Mb/s
git am OpenWrt-mtd-chages.patch
git cherry-pick -x 9316a9ed6895
patch -p1 < v7_dma_clean_range-align.diff
Test #2: 759 Mb/s
git reset --hard v4.19Test #1: 758 Mb/s
git am OpenWrt-mtd-chages.patch
git cherry-pick -x 9316a9ed6895
patch -p1 < v7_dma_inv_range-align.diff
Test #2: 759 Mb/s
git reset --hard v4.19Test #1: 767 Mb/s
git am OpenWrt-mtd-chages.patch
git cherry-pick -x 9316a9ed6895
patch -p1 < v7_dma_clean_range-align.diff
patch -p1 < v7_dma_inv_range-align.diff
Test #2: 763 Mb/s
Now you can see how unpredictable it is. If I cherry-pick 9316a9ed6895
and do an extra alignment of v7_dma_clean_range() and v7_dma_inv_range()
that extra alignment can actually *hurt* NAT performance.
You have a maximum variance of 4Mb/s in your tests which is around
0.5%, and this shows a reduction of 3Mb/s, or 0.4%.
If we look at it a different way:
- Without the alignment patches, there is a difference of 4% in
performance depending on whether 9316a9ed6895 is applied.
- With the alignment patches, there is a difference of 0.4% in
performance depending on whether 9316a9ed6895 is applied.
How can this not be beneficial?