Re: [PATCH net-next] page_pool: let the compiler optimize and inline core functions

From: Alexander Lobakin
Date: Tue Mar 23 2021 - 11:37:33 EST

Next message: Ahmad Fatoum: "[PATCH v3 1/2] driver core: clear deferred probe reason on probe retry"
Previous message: Andy Shevchenko: "[PATCH v2 5/7] usb: gadget: pch_udc: Revert d3cb25a12138 completely"
In reply to: Jesper Dangaard Brouer: "Re: [PATCH net-next] page_pool: let the compiler optimize and inline core functions"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

From: Jesper Dangaard Brouer <brouer@xxxxxxxxxx>
Date: Tue, 23 Mar 2021 10:01:38 +0100

> On Mon, 22 Mar 2021 18:30:55 +0000
> Alexander Lobakin <alobakin@xxxxx> wrote:
>
> > As per disscussion in Page Pool bulk allocator thread [0],
> > there are two functions in Page Pool core code that are marked as
> > 'noinline'. The reason for this is not so clear, and even if it
> > was made to reduce hotpath overhead, in fact it only makes things
> > worse.
> > As both of these functions as being called only once through the
> > code, they could be inlined/folded into the non-static entry point.
> > However, 'noinline' marks effectively prevent from doing that and
> > induce totally unneeded fragmentation (baseline -> after removal):
> >
> > add/remove: 0/3 grow/shrink: 1/0 up/down: 1024/-1096 (-72)
> > Function old new delta
> > page_pool_alloc_pages 100 1124 +1024
> > page_pool_dma_map 164 - -164
> > page_pool_refill_alloc_cache 332 - -332
> > __page_pool_alloc_pages_slow 600 - -600
> >
> > (taken from Mel's branch, hence factored-out page_pool_dma_map())
>
> I see that the refactor of page_pool_dma_map() caused it to be
> uninlined, that were a mistake. Thanks for high-lighting that again
> as I forgot about this (even-though I think Alex Duyck did point this
> out earlier).
>
> I am considering if we should allow compiler to inline
> page_pool_refill_alloc_cache + __page_pool_alloc_pages_slow, for the
> sake of performance and I loose the ability to diagnose the behavior
> from perf-report. Mind that page_pool avoids stat for the sake of
> performance, but these noinline makes it possible to diagnose the
> behavior anyway.
>
> >
> > 1124 is a normal hotpath frame size, but these jumps between tiny
> > page_pool_alloc_pages(), page_pool_refill_alloc_cache() and
> > __page_pool_alloc_pages_slow() are really redundant and harmful
> > for performance.
>
> Well, I disagree. (this is a NACK)
>
> If pages were recycled then the code never had to visit
> __page_pool_alloc_pages_slow(). And today without the bulk page-alloc
> (that we are working on adding together with Mel) we have to visit
> __page_pool_alloc_pages_slow() every time, which is a bad design, but
> I'm trying to fix that.
>
> Matteo is working on recycling here[1]:
> [1] https://lore.kernel.org/netdev/20210322170301.26017-1-mcroce@xxxxxxxxxxxxxxxxxxx/
>
> It would be really great if you could try out his patchset, as it will
> help your driver avoid the slow path of the page_pool. Given you are
> very detailed oriented, I do want to point out that Matteo's patchset
> is only the first step, as to really improve performance for page_pool,
> we need to bulk return these page_pool pages (it is requires some
> restructure of the core code, that will be confusing at this point).

I tested it out when I saw the first RFC. Its code seemed fine to me
and I was wondering what could it bring to my workloads.
The reason why I didn't post the results is because they're actually
poor on my system.

I retested it again, this time v1 instead of RFC and also tried
the combined with bulk allocation variant.

VLAN NAT, GRO + TSO/USO, Page size 16 Kb.
XDP_PASS -> napi_build_skb() -> napi_gro_receive().

I disable fraglist offload and nftables Flow offload to drop
the performance below link speed.

1.
- 5.12-rc3:

TCP 572 Mbps
UDP 616 Mbps

2.
- 5.12-rc3;
- Page Pool recycling by Matteo (with replacing
page_pool_release_page() with skb_mark_for_recycle()
in my driver):

TCP 540 Mbps
UDP 572 Mbps

First time when I saw the results, I didn't believe everything works
as expected from the code I saw, and pages are actually being recycled.
But then I traced skb and pages' paths and made sure that recycling
actually happens (on every frame).
The reason for such a heavy drop, at least that I can guess, is that
page_frag_free() that's being called on skb->head and its frags is
very lightweight and straightforward. When recycling is on, the
following chain is being called for skb head and every frag:

page_pool_return_skb_page()
xdp_return_skb_frame()
__xdp_return()
page_pool_put_full_page()

Also, as allow_direct is false (which is fine -- for context safety
reasons), recycled pages are being returned into the ptr_ring (with
taking and freeing the spinlock) instead of the direct cache. So next
Page Pool allocations will inavoidably hit (noinline)
page_pool_refill_alloc_cache(), take the spinlock again and so on.

3.
- 5.12-rc3;
- Page Pool recycling;
- bulk allocations:

TCP 545 Mbps
UDP 610 Mbps

As I wrote earlier, bulk allocator suffers from compiler which
uninlines __rmqueue_pcplist() and rmqueue_bulk(), at least on
my board.
So I don't take these results into account at all, instead:

4.
- 5.12-rc3;
- Page Pool recycling;
- bulk allocations with
- marking __rmqueue_pcplist() and rmqueue_bulk() as __always_inline:

TCP 590 Mbps
UDP 635 Mbps

I think here we finally hit the point where bulk allocations and
page recycling (perhaps partially) come in.
And just for reference:

5.
- 5.12-rc3;
- Page Pool recycling;
- bulk allocations, with
- marking __rmqueue_pcplist() and rmqueue_bulk() as __always_inline
and also
- dropping 'noinline' mark from page_pool_refill_alloc_cache() and
__page_pool_alloc_pages_slow():

TCP 595 Mbps
UDP 650 Mbps

- PP recycling always stores recycled pages in ptr_ring, so
page_pool_refill_alloc_cache() is still on the hotpath;
- bulk allocator places new pages into direct cache, but it
hides inside __page_pool_alloc_pages_slow().

>
> > This simple removal of 'noinline' keywords bumps the throughput
> > on XDP_PASS + napi_build_skb() + napi_gro_receive() on 25+ Mbps
> > for 1G embedded NIC.
> >
> > [0] https://lore.kernel.org/netdev/20210317222506.1266004-1-alobakin@xxxxx
> >
> > Signed-off-by: Alexander Lobakin <alobakin@xxxxx>
> > ---
> > net/core/page_pool.c | 2 --
> > 1 file changed, 2 deletions(-)
> >
> > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > index ad8b0707af04..589e4df6ef2b 100644
> > --- a/net/core/page_pool.c
> > +++ b/net/core/page_pool.c
> > @@ -102,7 +102,6 @@ EXPORT_SYMBOL(page_pool_create);
> >
> > static void page_pool_return_page(struct page_pool *pool, struct page *page);
> >
> > -noinline
> > static struct page *page_pool_refill_alloc_cache(struct page_pool *pool)
> > {
> > struct ptr_ring *r = &pool->ring;
> > @@ -181,7 +180,6 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
> > }
> >
> > /* slow path */
> > -noinline
> > static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
> > gfp_t _gfp)
> > {
> > --
> > 2.31.0
> >
> >
>
>
>
> --
> Best regards,
> Jesper Dangaard Brouer
> MSc.CS, Principal Kernel Engineer at Red Hat
> LinkedIn: http://www.linkedin.com/in/brouer

Thanks,
Al

Next message: Ahmad Fatoum: "[PATCH v3 1/2] driver core: clear deferred probe reason on probe retry"
Previous message: Andy Shevchenko: "[PATCH v2 5/7] usb: gadget: pch_udc: Revert d3cb25a12138 completely"
In reply to: Jesper Dangaard Brouer: "Re: [PATCH net-next] page_pool: let the compiler optimize and inline core functions"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]