Re: [PATCH][v2] page_pool: handle page recycle for NUMA_NO_NODE condition

From: Saeed Mahameed
Date: Tue Dec 17 2019 - 14:27:27 EST


On Thu, 2019-12-12 at 11:18 +0100, Jesper Dangaard Brouer wrote:
> On Thu, 12 Dec 2019 09:34:14 +0800
> Yunsheng Lin <linyunsheng@xxxxxxxxxx> wrote:
>
> > +CC Michal, Peter, Greg and Bjorn
> > Because there has been disscusion about where and how the
> > NUMA_NO_NODE
> > should be handled before.
> >
> > On 2019/12/12 5:24, Saeed Mahameed wrote:
> > > On Wed, 2019-12-11 at 19:49 +0100, Jesper Dangaard Brouer
> > > wrote:
> > > > On Sat, 7 Dec 2019 03:52:41 +0000
> > > > Saeed Mahameed <saeedm@xxxxxxxxxxxx> wrote:
> > > >
> > > > > I don't think it is correct to check that the page nid is
> > > > > same as
> > > > > numa_mem_id() if pool is NUMA_NO_NODE. In such case we should
> > > > > allow
> > > > > all pages to recycle, because you can't assume where pages
> > > > > are
> > > > > allocated from and where they are being handled.
> > > >
> > > > I agree, using numa_mem_id() is not valid, because it takes the
> > > > numa
> > > > node id from the executing CPU and the call to
> > > > __page_pool_put_page()
> > > > can happen on a remote CPU (e.g. cpumap redirect, and in future
> > > > SKBs).
> > > >
> > > >
> > > > > I suggest the following:
> > > > >
> > > > > return !page_pfmemalloc() &&
> > > > > ( page_to_nid(page) == pool->p.nid || pool->p.nid ==
> > > > > NUMA_NO_NODE );
> > > >
> > > > Above code doesn't generate optimal ASM code, I suggest:
> > > >
> > > > static bool pool_page_reusable(struct page_pool *pool, struct
> > > > page *page)
> > > > {
> > > > return !page_is_pfmemalloc(page) &&
> > > > pool->p.nid != NUMA_NO_NODE &&
> > > > page_to_nid(page) == pool->p.nid;
> > > > }
> > > >
> > >
> > > this is not equivalent to the above. Here in case pool->p.nid is
> > > NUMA_NO_NODE, pool_page_reusable() will always be false.
> > >
> > > We can avoid the extra check in data path.
> > > How about avoiding NUMA_NO_NODE in page_pool altogether, and
> > > force
> > > numa_mem_id() as pool->p.nid when user requests NUMA_NO_NODE at
> > > page
> > > pool init, as already done in alloc_pages_node().
> >
> > That means we will not support page reuse mitigation for
> > NUMA_NO_NODE,
> > which is not same semantic that alloc_pages_node() handle
> > NUMA_NO_NODE,
> > because alloc_pages_node() will allocate the page based on the node
> > of the current running cpu.
>
> True, as I wrote (below) my code defines semantics as: that a
> page_pool
> configured with NUMA_NO_NODE means skip NUMA checks, and allow
> recycle

Your code will NOT allow recycling when NUMA_NO_NODE is configured.
so i am not sure what semantics you are referring to :)

> regardless of NUMA node page belong to. It seems that you want
> another
> semantics.
>

I think that the semantics we want is to allow recycling when
NUMA_NO_NODE is selected, to solve the reported issue ? no ?

> I'm open to other semantics. My main concern is performance. The
> page_pool fast-path for driver recycling use-case of XDP_DROP, have
> extreme performance requirements, as it needs to compete with driver
> local recycle tricks (else we cannot use page_pool to simplify
> drivers).
> The extreme performance target is 100Gbit/s = 148Mpps = 6.72ns, and
> in practice I'm measuring 25Mpps = 40ns with Mlx5 driver (single q),
> and BjÃrn is showing 30 Mpps = 33.3ns with i40e. At this level every
> cycle/instruction counts.
>

I agree.

>
> > Also, There seems to be a wild guessing of the node id here, which
> > has
> > been disscussed before and has not reached a agreement yet.
> >
> > > which will imply recycling without adding any extra condition to
> > > the
> > > data path.
>
> I love code that moves thing out of our fast-path.
>
> > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > > index a6aefe989043..00c99282a306 100644
> > > --- a/net/core/page_pool.c
> > > +++ b/net/core/page_pool.c
> > > @@ -28,6 +28,9 @@ static int page_pool_init(struct page_pool
> > > *pool,
> > >
> > > memcpy(&pool->p, params, sizeof(pool->p));
> > >
> > > + /* overwrite to allow recycling.. */
> > > + if (pool->p.nid == NUMA_NO_NODE)
> > > + pool->p.nid = numa_mem_id();
> > > +
>
> The problem is that page_pool_init() is can be initiated from a
> random
> CPU, first at driver setup/bringup, and later at other queue changes
> that can be started via ethtool or XDP attach. (numa_mem_id() picks
> from running CPU).
>

well if the user selected NUMA_NO_NODE, then it is his choice ..
Also the user always have the ability to change pool->p.nid, using the
API i introduced. so i don't see any issue with this.

> As Yunsheng mentioned elsewhere, there is also a dev_to_node()
> function.
> Isn't that what we want in a place like this?
>

We should keep this outside page_pool, if the user want dev_to_node, he
can set this as a parameter to the page pool from outside. when
NUMA_NO_NODE is selected this means that user doesn't really care.

>
> One issue with dev_to_node() is that in case of !CONFIG_NUMA it
> returns
> NUMA_NO_NODE (-1). (And device_initialize() also set it to
> -1). Thus,
> in that case we set pool->p.nid = 0, as page_to_nid() will also
> return
> zero in that case (as far as I follow the code).
>

yes this is the idea, since alloc_page will also select cpu 0 if you
keep NUMA_NO_NODE, then recycle will happen inherently by design
without any change to data path.

> > > After a quick look, i don't see any reason why to keep
> > > NUMA_NO_NODE in
> > > pool->p.nid..
> > >
> > >
> > > > I have compiled different variants and looked at the ASM code
> > > > generated by GCC. This seems to give the best result.
> > > >
> > > >
> > > > > 1) never recycle emergency pages, regardless of pool nid.
> > > > > 2) always recycle if pool is NUMA_NO_NODE.
> > > >
> > > > Yes, this defines the semantics, that a page_pool configured
> > > > with
> > > > NUMA_NO_NODE means skip NUMA checks. I think that sounds
> > > > okay...
> > > >
> > > >
> > > > > the above change should not add any overhead, a modest branch
> > > > > predictor will handle this with no effort.
> > > >
> > > > It still annoys me that we keep adding instructions to this
> > > > code
> > > > hot-path (I counted 34 bytes and 11 instructions in my proposed
> > > > function).
> > > >
> > > > I think that it might be possible to move these NUMA checks to
> > > > alloc-side (instead of return/recycles side as today), and
> > > > perhaps
> > > > only on slow-path when dequeuing from ptr_ring (as recycles
> > > > that
> > > > call __page_pool_recycle_direct() will be pinned during NAPI).
> > > > But lets focus on a smaller fix for the immediate issue...
> > > >
> > >
> > > I know. It annoys me too, but we need recycling to work in
> > > production : where rings/napi can migrate and numa nodes can be
> > > NUMA_NO_NODE :-(.
> > >
> > >