On Sun, Nov 11, 2018 at 4:39 PM PaweÅ Staszewski <pstaszewski@xxxxxxxxx> wrote:
I wouldn't recommend adaptive just because the behavior would be hard
W dniu 12.11.2018 o 00:05, Alexander Duyck pisze:
On Sat, Nov 10, 2018 at 3:54 PM PaweÅ Staszewski <pstaszewski@xxxxxxxxx> wrote:Im trying to balance here - there is problem cause server is forwarding
The problem is if you are going for less interrupts you are setting
W dniu 05.11.2018 o 16:44, Alexander Duyck pisze:
On Mon, Nov 5, 2018 at 12:58 AM Aaron Lu <aaron.lu@xxxxxxxxx> wrote:Thanks Aleksandar - yes it can be - but in my scenario setting RX buffer
page_frag_free() calls __free_pages_ok() to free the page back toOne thing I would suggest for Pawel to try would be to reduce the Tx
Buddy. This is OK for high order page, but for order-0 pages, it
misses the optimization opportunity of using Per-Cpu-Pages and can
cause zone lock contention when called frequently.
PaweÅ Staszewski recently shared his result of 'how Linux kernel
handles normal traffic'[1] and from perf data, Jesper Dangaard Brouer
found the lock contention comes from page allocator:
mlx5e_poll_tx_cq
|
--16.34%--napi_consume_skb
|
|--12.65%--__free_pages_ok
| |
| --11.86%--free_one_page
| |
| |--10.10%--queued_spin_lock_slowpath
| |
| --0.65%--_raw_spin_lock
|
|--1.55%--page_frag_free
|
--1.44%--skb_release_data
Jesper explained how it happened: mlx5 driver RX-page recycle
mechanism is not effective in this workload and pages have to go
through the page allocator. The lock contention happens during
mlx5 DMA TX completion cycle. And the page allocator cannot keep
up at these speeds.[2]
I thought that __free_pages_ok() are mostly freeing high order
pages and thought this is an lock contention for high order pages
but Jesper explained in detail that __free_pages_ok() here are
actually freeing order-0 pages because mlx5 is using order-0 pages
to satisfy its page pool allocation request.[3]
The free path as pointed out by Jesper is:
skb_free_head()
-> skb_free_frag()
-> skb_free_frag()
-> page_frag_free()
And the pages being freed on this path are order-0 pages.
Fix this by doing similar things as in __page_frag_cache_drain() -
send the being freed page to PCP if it's an order-0 page, or
directly to Buddy if it is a high order page.
With this change, PaweÅ hasn't noticed lock contention yet in
his workload and Jesper has noticed a 7% performance improvement
using a micro benchmark and lock contention is gone.
[1]: https://www.spinics.net/lists/netdev/msg531362.html
[2]: https://www.spinics.net/lists/netdev/msg531421.html
[3]: https://www.spinics.net/lists/netdev/msg531556.html
Reported-by: PaweÅ Staszewski <pstaszewski@xxxxxxxxx>
Analysed-by: Jesper Dangaard Brouer <brouer@xxxxxxxxxx>
Signed-off-by: Aaron Lu <aaron.lu@xxxxxxxxx>
---
mm/page_alloc.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ae31839874b8..91a9a6af41a2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4555,8 +4555,14 @@ void page_frag_free(void *addr)
{
struct page *page = virt_to_head_page(addr);
- if (unlikely(put_page_testzero(page)))
- __free_pages_ok(page, compound_order(page));
+ if (unlikely(put_page_testzero(page))) {
+ unsigned int order = compound_order(page);
+
+ if (order == 0)
+ free_unref_page(page);
+ else
+ __free_pages_ok(page, order);
+ }
}
EXPORT_SYMBOL(page_frag_free);
qdisc size on his transmitting interfaces, Reduce the Tx ring size,
and possibly increase the Tx interrupt rate. Ideally we shouldn't have
too many packets in-flight and I suspect that is the issue that Pawel
is seeing that is leading to the page pool allocator freeing up the
memory. I know we like to try to batch things but the issue is
processing too many Tx buffers in one batch leads to us eating up too
much memory and causing evictions from the cache. Ideally the Rx and
Tx rings and queues should be sized as small as possible while still
allowing us to process up to our NAPI budget. Usually I run things
with a 128 Rx / 128 Tx setup and then reduce the Tx queue length so we
don't have more buffers stored there than we can place in the Tx ring.
Then we can avoid the extra thrash of having to pull/push memory into
and out of the freelists. Essentially the issue here ends up being
another form of buffer bloat.
<4096 producing more interface rx drops - and no_rx_buffer on network
controller that is receiving more packets
So i need to stick with 3000-4000 on RX - and yes i was trying to lower
the TX buff on connectx4 - but that changed nothing before Aaron patch
After Aaron patch - decreasing TX buffer influencing total bandwidth
that can be handled by the router/server
Dono why before this patch there was no difference there no matter what
i set there there was always page_alloc/slowpath on top in perf
Currently testing RX4096/TX256 - this helps with bandwidth like +10%
more bandwidth with less interrupts...
yourself up for buffer bloat. Basically you are going to use much more
cache and much more memory then you actually need and if things are
properly configured NAPI should take care of the interrupts anyway
since under maximum load you shouldn't stop polling normally.
all kingd of protocols packets/different size etc
The problem is im trying to go in high interrupt rate - but
Setting coalescence to adaptative for rx killing cpu's at 22Gbit/s RX
and 22Gbit with rly high interrupt rate
to predict.
So adding a little more latency i can turn off adaptative rx and setupWhat about the tx-usecs, is that a functional thing for the adapter
rx-usecs from range 16-64 - and this gives me more or less interrupts -
but the problem is - always same bandwidth as maximum
you are using?
The Rx side logic should be pretty easy to figure out. Essentially you
want to keep the Rx ring size as small as possible while at the same
time avoiding storming the system with interrupts. I know for 10Gb/s I
have used a value of 25us in the past. What you want to watch for is
if you are dropping packets on the Rx side or not. Ideally you want
enough buffers that you can capture any burst while you wait for the
interrupt routine to catch up.
Right so the issue itself isn't Rx, you aren't throttled there. We areOne issue I have seen is people delay interrupts for as long asSure this is bad to setup rx-usec for high values - cause at some point
possible which isn't really a good thing since most network
controllers will use NAPI which will disable the interrupts and leave
them disabled whenever the system is under heavy stress so you should
be able to get the maximum performance by configuring an adapter with
small ring sizes and for high interrupt rates.
this will add high latency for packet traversing both sides - and start
to hurt buffers
But my problem is a little different now i have no problems with RX side
- cause i can setup anything like:
coalescence from 16 to 64
rx ring from 3000 to max 8192
And it does not change my max bw - only produces less or more interrupts.
probably looking at an issue of PCIe bandwidth or Tx slowing things
down. The fact that you are still filing interrupts is a bit
surprising though. Are the Tx and Rx interrupts linked for the device
you are using or are they firing them seperately? Normally Rx traffic
won't generate many interrupts under a stress test as NAPI will leave
the interrupts disabled unless it can keep up. Anyway, my suggestion
would be to look at tuning things for as small a ring size as
possible.
So I start to change params for TX side - and for now i know that theSo this sounds like you are likely bottlenecked due to either PCIe
best for me is
coalescence adaptative on
TX buffer 128
This helps with max BW that for now is close to 70Gbit/s RX and 70Gbit
TX but after this change i have increasing DROPS on TX side for vlan
interfaces.
bandwidth or latency. When you start putting back-pressure on the Tx
like you have described it starts pushing packets onto the Qdisc
layer. One thing that happens when packets are on the qdisc layer is
that they can start to perform a bulk dequeue. The side effect of this
is that you write multiple packets to the descriptor ring and then
update the hardware doorbell only once for the entire group of packets
instead of once per packet.
And only 50% cpu (max was 50% for 70Gbit/s)It sounds to me like XDP would probably be your best bet. With that
It is easiest to think of it this way. Your total packet rate is equalYes - in normal life traffic - most of ddos'es are like this many pps
to your interrupt rate times the number of buffers you will store in
the ring. So if you have some fixed rate "X" for packets and an
interrupt rate of "i" then your optimal ring size should be "X/i". So
if you lower the interrupt rate you end up hurting the throughput
unless you increase the buffer size. However at a certain point the
buffer size starts becoming an issue. For example with UDP flows I
often see massive packet drops if you tune the interrupt rate too low
and then put the system under heavy stress.
with small frames.
you could probably get away with smaller ring sizes, higher interrupt
rates, and get the advantage of it batching the Tx without having to
drop packets.