Re: [PATCH net-next 0/6] page_pool: recycle buffers

From: Alexander Lobakin
Date: Tue Mar 23 2021 - 12:56:18 EST


From: Matteo Croce <mcroce@xxxxxxxxxxxxxxxxxxx>
Date: Tue, 23 Mar 2021 17:28:32 +0100

> On Tue, Mar 23, 2021 at 5:10 PM Ilias Apalodimas
> <ilias.apalodimas@xxxxxxxxxx> wrote:
> >
> > On Tue, Mar 23, 2021 at 05:04:47PM +0100, Jesper Dangaard Brouer wrote:
> > > On Tue, 23 Mar 2021 17:47:46 +0200
> > > Ilias Apalodimas <ilias.apalodimas@xxxxxxxxxx> wrote:
> > >
> > > > On Tue, Mar 23, 2021 at 03:41:23PM +0000, Alexander Lobakin wrote:
> > > > > From: Matteo Croce <mcroce@xxxxxxxxxxxxxxxxxxx>
> > > > > Date: Mon, 22 Mar 2021 18:02:55 +0100
> > > > >
> > > > > > From: Matteo Croce <mcroce@xxxxxxxxxxxxx>
> > > > > >
> > > > > > This series enables recycling of the buffers allocated with the page_pool API.
> > > > > > The first two patches are just prerequisite to save space in a struct and
> > > > > > avoid recycling pages allocated with other API.
> > > > > > Patch 2 was based on a previous idea from Jonathan Lemon.
> > > > > >
> > > > > > The third one is the real recycling, 4 fixes the compilation of __skb_frag_unref
> > > > > > users, and 5,6 enable the recycling on two drivers.
> > > > > >
> > > > > > In the last two patches I reported the improvement I have with the series.
> > > > > >
> > > > > > The recycling as is can't be used with drivers like mlx5 which do page split,
> > > > > > but this is documented in a comment.
> > > > > > In the future, a refcount can be used so to support mlx5 with no changes.
> > > > > >
> > > > > > Ilias Apalodimas (2):
> > > > > > page_pool: DMA handling and allow to recycles frames via SKB
> > > > > > net: change users of __skb_frag_unref() and add an extra argument
> > > > > >
> > > > > > Jesper Dangaard Brouer (1):
> > > > > > xdp: reduce size of struct xdp_mem_info
> > > > > >
> > > > > > Matteo Croce (3):
> > > > > > mm: add a signature in struct page
> > > > > > mvpp2: recycle buffers
> > > > > > mvneta: recycle buffers
> > > > > >
> > > > > > .../chelsio/inline_crypto/ch_ktls/chcr_ktls.c | 2 +-
> > > > > > drivers/net/ethernet/marvell/mvneta.c | 4 +-
> > > > > > .../net/ethernet/marvell/mvpp2/mvpp2_main.c | 17 +++----
> > > > > > drivers/net/ethernet/marvell/sky2.c | 2 +-
> > > > > > drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 +-
> > > > > > include/linux/mm_types.h | 1 +
> > > > > > include/linux/skbuff.h | 33 +++++++++++--
> > > > > > include/net/page_pool.h | 15 ++++++
> > > > > > include/net/xdp.h | 5 +-
> > > > > > net/core/page_pool.c | 47 +++++++++++++++++++
> > > > > > net/core/skbuff.c | 20 +++++++-
> > > > > > net/core/xdp.c | 14 ++++--
> > > > > > net/tls/tls_device.c | 2 +-
> > > > > > 13 files changed, 138 insertions(+), 26 deletions(-)
> > > > >
> > > > > Just for the reference, I've performed some tests on 1G SoC NIC with
> > > > > this patchset on, here's direct link: [0]
> > > > >
> > > >
> > > > Thanks for the testing!
> > > > Any chance you can get a perf measurement on this?
> > >
> > > I guess you mean perf-report (--stdio) output, right?
> > >
> >
> > Yea,
> > As hinted below, I am just trying to figure out if on Alexander's platform the
> > cost of syncing, is bigger that free-allocate. I remember one armv7 were that
> > was the case.
> >
> > > > Is DMA syncing taking a substantial amount of your cpu usage?
> > >
> > > (+1 this is an important question)

Sure, I'll drop perf tools to my test env and share the results,
maybe tomorrow or in a few days.
>From what I know for sure about MIPS and my platform,
post-Rx synching (dma_sync_single_for_cpu()) is a no-op, and
pre-Rx (dma_sync_single_for_device() etc.) is a bit expensive.
I always have sane page_pool->pp.max_len value (smth about 1668
for MTU of 1500) to minimize the overhead.

By the word, IIRC, all machines shipped with mvpp2 have hardware
cache coherency units and don't suffer from sync routines at all.
That may be the reason why mvpp2 wins the most from this series.

> > > > >
> > > > > [0] https://lore.kernel.org/netdev/20210323153550.130385-1-alobakin@xxxxx
> > > > >
> > >
>
> That would be the same as for mvneta:
>
> Overhead Shared Object Symbol
> 24.10% [kernel] [k] __pi___inval_dcache_area
> 23.02% [mvneta] [k] mvneta_rx_swbm
> 7.19% [kernel] [k] kmem_cache_alloc
>
> Anyway, I tried to use the recycling *and* napi_build_skb on mvpp2,
> and I get lower packet rate than recycling alone.
> I don't know why, we should investigate it.

mvpp2 driver doesn't use napi_consume_skb() on its Tx completion path.
As a result, NAPI percpu caches get refilled only through
kmem_cache_alloc_bulk(), and most of skbuff_head recycling
doesn't work.

> Regards,
> --
> per aspera ad upstream

Oh, I love that one!

Al