Hi Jesper,Great that you are also testing AF_XDP, but do you have a particular
On mvneta driver/platform we saw huge speedup replacing:
page_pool_release_page(rxq->page_pool, page); with
skb_mark_for_recycle(skb);
After replacing the page_pool_release_page with the
skb_mark_for_recycle, I found something confused me a little in the
testing result. >
I tested with the sample app of "xdpsock" under two modes: 1. Native (xdpsock -i eth0). 2. Skb-mode (xdpsock -S -i eth0).
The following are the testing result:>
With page_pool_release_page (pps) With skb_mark_for_recycle (pps)
SKB-Mode 90K 200K
Native 190K 190K
The skb_mark_for_recycle solution boosted the performance of SKB-ModeI might be able to explain this (Cc. AF_XDP maintainers to keep me honest).
to 200K+ PPS. That is even higher than the performance of Native
solution. Is this result reasonable? Do you have any clue why the
SKB-Mode performance can go higher than that of Native one?
-----Original Message-----
From: Jesper Dangaard Brouer <jbrouer@xxxxxxxxxx>
Sent: Thursday, September 29, 2022 1:55 PM
To: Shenwei Wang <shenwei.wang@xxxxxxx>; Jesper Dangaard Brouer
<jbrouer@xxxxxxxxxx>; Andrew Lunn <andrew@xxxxxxx>
Cc: brouer@xxxxxxxxxx; Joakim Zhang <qiangqing.zhang@xxxxxxx>; David S.
Miller <davem@xxxxxxxxxxxxx>; Eric Dumazet <edumazet@xxxxxxxxxx>; Jakub
Kicinski <kuba@xxxxxxxxxx>; Paolo Abeni <pabeni@xxxxxxxxxx>; Alexei
Starovoitov <ast@xxxxxxxxxx>; Daniel Borkmann <daniel@xxxxxxxxxxxxx>;
Jesper Dangaard Brouer <hawk@xxxxxxxxxx>; John Fastabend
<john.fastabend@xxxxxxxxx>; netdev@xxxxxxxxxxxxxxx; linux-
kernel@xxxxxxxxxxxxxxx; imx@xxxxxxxxxxxxxxx
Subject: Re: [EXT] Re: [PATCH 1/1] net: fec: add initial XDP support
Caution: EXT Email
On 29/09/2022 17.52, Shenwei Wang wrote:
FEC.
From: Jesper Dangaard Brouer <jbrouer@xxxxxxxxxx>
On 29/09/2022 15.26, Shenwei Wang wrote:
[...]
From: Andrew Lunn <andrew@xxxxxxx>
Sent: Thursday, September 29, 2022 8:23 AM
I actually did some compare testing regarding the page pool for
normal traffic. So far I don't see significant improvement in the
current implementation. The performance for large packets improves
a little, and the performance for small packets get a little worse.
What hardware was this for? imx51? imx6? imx7 Vybrid? These all use the
tricks I didn't notice.
I tested on imx8qxp platform. It is ARM64.
On mvneta driver/platform we saw huge speedup replacing:
page_pool_release_page(rxq->page_pool, page); with
skb_mark_for_recycle(skb);
As I mentioned: Today page_pool have SKB recycle support (you might
have looked at drivers that didn't utilize this yet), thus you don't
need to release the page (page_pool_release_page) here. Instead you
could simply mark the SKB for recycling, unless driver does some page refcnt
On the mvneta driver/platform the DMA unmap (in
page_pool_release_page) was very expensive. This imx8qxp platform
might have faster DMA unmap in case is it cache-coherent.
I would be very interested in knowing if skb_mark_for_recycle() helps
on this platform, for normal network stack performance.
Did a quick compare testing for the following 3 scenarios:
Thanks for doing this! :-)
1. original implementation
shenwei@5810:~$ iperf -c 10.81.16.245 -w 2m -i 1
------------------------------------------------------------
Client connecting to 10.81.16.245, TCP port 5001 TCP window size: 416
KByte (WARNING: requested 1.91 MByte)
------------------------------------------------------------
[ 1] local 10.81.17.20 port 49154 connected with 10.81.16.245 port 5001
[ ID] Interval Transfer Bandwidth
[ 1] 0.0000-1.0000 sec 104 MBytes 868 Mbits/sec
[ 1] 1.0000-2.0000 sec 105 MBytes 878 Mbits/sec
[ 1] 2.0000-3.0000 sec 105 MBytes 881 Mbits/sec
[ 1] 3.0000-4.0000 sec 105 MBytes 879 Mbits/sec
[ 1] 4.0000-5.0000 sec 105 MBytes 878 Mbits/sec
[ 1] 5.0000-6.0000 sec 105 MBytes 878 Mbits/sec
[ 1] 6.0000-7.0000 sec 104 MBytes 875 Mbits/sec
[ 1] 7.0000-8.0000 sec 104 MBytes 875 Mbits/sec
[ 1] 8.0000-9.0000 sec 104 MBytes 873 Mbits/sec
[ 1] 9.0000-10.0000 sec 104 MBytes 875 Mbits/sec
[ 1] 0.0000-10.0073 sec 1.02 GBytes 875 Mbits/sec
2. Page pool with page_pool_release_page
shenwei@5810:~$ iperf -c 10.81.16.245 -w 2m -i 1
------------------------------------------------------------
Client connecting to 10.81.16.245, TCP port 5001 TCP window size: 416
KByte (WARNING: requested 1.91 MByte)
------------------------------------------------------------
[ 1] local 10.81.17.20 port 35924 connected with 10.81.16.245 port 5001
[ ID] Interval Transfer Bandwidth
[ 1] 0.0000-1.0000 sec 101 MBytes 849 Mbits/sec
[ 1] 1.0000-2.0000 sec 102 MBytes 860 Mbits/sec
[ 1] 2.0000-3.0000 sec 102 MBytes 860 Mbits/sec
[ 1] 3.0000-4.0000 sec 102 MBytes 859 Mbits/sec
[ 1] 4.0000-5.0000 sec 103 MBytes 863 Mbits/sec
[ 1] 5.0000-6.0000 sec 103 MBytes 864 Mbits/sec
[ 1] 6.0000-7.0000 sec 103 MBytes 863 Mbits/sec
[ 1] 7.0000-8.0000 sec 103 MBytes 865 Mbits/sec
[ 1] 8.0000-9.0000 sec 103 MBytes 862 Mbits/sec
[ 1] 9.0000-10.0000 sec 102 MBytes 856 Mbits/sec
[ 1] 0.0000-10.0246 sec 1.00 GBytes 858 Mbits/sec
3. page pool with skb_mark_for_recycle
shenwei@5810:~$ iperf -c 10.81.16.245 -w 2m -i 1
------------------------------------------------------------
Client connecting to 10.81.16.245, TCP port 5001 TCP window size: 416
KByte (WARNING: requested 1.91 MByte)
------------------------------------------------------------
[ 1] local 10.81.17.20 port 42724 connected with 10.81.16.245 port 5001
[ ID] Interval Transfer Bandwidth
[ 1] 0.0000-1.0000 sec 111 MBytes 931 Mbits/sec
[ 1] 1.0000-2.0000 sec 112 MBytes 935 Mbits/sec
[ 1] 2.0000-3.0000 sec 111 MBytes 934 Mbits/sec
[ 1] 3.0000-4.0000 sec 111 MBytes 934 Mbits/sec
[ 1] 4.0000-5.0000 sec 111 MBytes 934 Mbits/sec
[ 1] 5.0000-6.0000 sec 112 MBytes 935 Mbits/sec
[ 1] 6.0000-7.0000 sec 111 MBytes 934 Mbits/sec
[ 1] 7.0000-8.0000 sec 111 MBytes 933 Mbits/sec
[ 1] 8.0000-9.0000 sec 112 MBytes 935 Mbits/sec
[ 1] 9.0000-10.0000 sec 111 MBytes 933 Mbits/sec
[ 1] 0.0000-10.0069 sec 1.09 GBytes 934 Mbits/sec
This is a very significant performance improvement (page pool with
skb_mark_for_recycle). This is very close to the max goodput for a 1Gbit/s link.
For small packet size (64 bytes), all three cases have almost the same result:
To me this indicate, that the DMA map/unmap operations on this platform are
indeed more expensive on larger packets. Given this is what page_pool does,
keeping the DMA mapping intact when recycling.
Driver still need DMA-sync, although I notice you set page_pool feature flag
PP_FLAG_DMA_SYNC_DEV, this is good as page_pool will try to reduce sync size
where possible. E.g. in this SKB case will reduce the DMA-sync to the
max_len=FEC_ENET_RX_FRSIZE which should also help on performance.
shenwei@5810:~$ iperf -c 10.81.16.245 -w 2m -i 1 -l 64
------------------------------------------------------------
Client connecting to 10.81.16.245, TCP port 5001 TCP window size: 416
KByte (WARNING: requested 1.91 MByte)
------------------------------------------------------------
[ 1] local 10.81.17.20 port 58204 connected with 10.81.16.245 port 5001
[ ID] Interval Transfer Bandwidth
[ 1] 0.0000-1.0000 sec 36.9 MBytes 309 Mbits/sec
[ 1] 1.0000-2.0000 sec 36.6 MBytes 307 Mbits/sec
[ 1] 2.0000-3.0000 sec 36.6 MBytes 307 Mbits/sec
[ 1] 3.0000-4.0000 sec 36.5 MBytes 307 Mbits/sec
[ 1] 4.0000-5.0000 sec 37.1 MBytes 311 Mbits/sec
[ 1] 5.0000-6.0000 sec 37.2 MBytes 312 Mbits/sec
[ 1] 6.0000-7.0000 sec 37.1 MBytes 311 Mbits/sec
[ 1] 7.0000-8.0000 sec 37.1 MBytes 311 Mbits/sec
[ 1] 8.0000-9.0000 sec 37.1 MBytes 312 Mbits/sec
[ 1] 9.0000-10.0000 sec 37.2 MBytes 312 Mbits/sec
[ 1] 0.0000-10.0097 sec 369 MBytes 310 Mbits/sec
Regards,
Shenwei
By small packets, do you mean those under the copybreak limit?
Please provide some benchmark numbers with your next patchset.
Yes, the packet size is 64 bytes and it is under the copybreak limit.
As the impact is not significant, I would prefer to remove the
copybreak logic.
+1 to removing this logic if possible, due to maintenance cost.
--Jesper