Re: [PATCH v2] net: skbuff: set FLAG_SKB_NO_MERGE for skbuff_fclone_cache

From: Eric Dumazet
Date: Wed Feb 28 2024 - 04:39:03 EST


On Wed, Feb 28, 2024 at 8:06 AM Shijie Huang
<shijie@xxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>
>
> 在 2024/2/27 20:55, Eric Dumazet 写道:
> > On Tue, Feb 27, 2024 at 7:29 AM Huang Shijie
> > <shijie@xxxxxxxxxxxxxxxxxxxxxx> wrote:
> >> Since we do not set FLAG_SKB_NO_MERGE for skbuff_fclone_cache,
> >> the current skbuff_fclone_cache maybe not really allocated, it maybe
> >> used an exist old kmem_cache. In NUMA, the fclone allocated by
> >> alloc_skb_fclone() maybe in remote node.
> > Why is this happening in the first place ? Whab about skb->head ?
>
> I tested the fclone firstly. I did not test others yet.
>
> I did not check the skb->head yet.
>
> But I ever checked the pfrag's page, it is okay.
>
>
> >
> > Jesper patch [1] motivation was not about NUMA., but about
> > fragmentation and bulk allocations/freeing.
> >
> > TCP fclones are not bulk allocated/freed, so I do not understand what
> > your patch is doing.
> > You need to give more details, and experimental results.
>
> 1.) My NUMA machine:
>
> node 0 (CPU 0 ~ CPU79):
>
> CPU 0 ~ CPU 39 are used as memcached's server
>
> CPU 40 ~ CPU 79 are used as memcached's client
>
> node 1 (CPU 80 ~ CPU160):
>
> CPU 80 ~ CPU 119 are used as memcached's server
>
> CPU 120 ~ CPU 179 are used as memcached's client
>
> the kernel is linux-next 20240227
>
>
> 2.) My private patches:
>
> patch 1 is for slub:
>
> ---
> mm/slub.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 5d838ebfa35e..d2ab1e36fd6b 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -5691,6 +5691,7 @@ __kmem_cache_alias(const char *name, unsigned int
> size, unsigned int align,
>
> s = find_mergeable(size, align, flags, name, ctor);
> if (s) {
> + printk("[%s] origin:%s, shared :%s\n", __func__, name,
> s->name);
> if (sysfs_slab_alias(s, name))
> return NULL;
>
> ---------
>
> This patch is used the check which is the sharing kmem_cache for
> "skbuff_fclone_cache".
>
> I cannot find the "skbuff_fclone_cache" in /proc/slabinfo.
>
> From my test, the "pool_workqueue" is the real working kmem_cache.
>
> The "skbuff_fclone_cache" is just a pointer to "pool_workqueue"
> (pwq_cache).
>
>
> The following private patch is used to record the fclone allocation:
>
> ---
> net/ipv4/tcp.c | 19 +++++++++++++++++++
> 1 file changed, 19 insertions(+)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index c82dc42f57c6..6f31ddcfc017 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -864,6 +864,24 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t
> *ppos,
> }
> EXPORT_SYMBOL(tcp_splice_read);
>
> +unsigned long right_num, wrong_num;
> +static void check_fclone(struct sk_buff *skb)
> +{
> + int n = numa_mem_id(); /* current node */
> + int node2 = page_to_nid(virt_to_page((unsigned long) skb));
> +
> + if (n != node2) {
> + wrong_num++;
> + if ((wrong_num % 1000) == 999)
> + printk(KERN_DEBUG "[%s] current:%d, get from
> :%d, (%ld, %ld, %ld)\n",
> + __func__, n, node2, wrong_num,
> right_num, wrong_num * 100 / (wrong_num + right_num));
> + } else {
> + right_num++;
> + if ((right_num % 1000000) == 9999)
> + printk("[%s] we received :%ld, %ld\n", __func__,
> right_num, wrong_num);
> + }
> +}
> +
> struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, gfp_t gfp,
> bool force_schedule)
> {
> @@ -884,6 +902,7 @@ struct sk_buff *tcp_stream_alloc_skb(struct sock
> *sk, gfp_t gfp,
> skb_reserve(skb, MAX_TCP_HEADER);
> skb->ip_summed = CHECKSUM_PARTIAL;
> INIT_LIST_HEAD(&skb->tcp_tsorted_anchor);
> + check_fclone(skb);
> return skb;
> }
> __kfree_skb(skb);
> --
>
> Without this v2 patch, I can get the result after the memcached test:
>
> [ 1027.317645] [check_fclone] current:0, get from :1, (7112999, 9076711, 43)
> [ 1027.317653] [check_fclone] current:0, get from :1, (7112999, 9076707, 43)
> [ 1027.804110] [check_fclone] we received :10009999, 7113326
>
> It means nearly 43% fclone is allocated in the remote node.
>
>
> With this v2 patch, I can find the "skbuff_fclone_cache" in
> /proc/slabinfo.
>
> The test result shows below:
>
> [ 503.357293] [check_fclone] we received :8009999, 0
> [ 503.357293] [check_fclone] we received :8009999, 0
> [ 503.357305] [check_fclone] we received :8009999, 0
>
> After v2 patch, I cannot see the wrong fclone in remote node.
>
>
> >
> > Using SLAB_NO_MERGE does not help, I am still seeing wrong allocations
> > on a dual socket
> > host with plenty of available memory.
> > (either sk_buff or skb->head being allocated on the other node).
>
> Do you mean you still can see the wrong fclone after using SLAB_NO_MERGE?
>
> If so, I guess there is bug in the slub.
>
>
> > fclones might be allocated from a cpu running on node A, and freed
> > from a cpu running on node B.
> > Maybe SLUB is not properly handling this case ?
>
> Maybe.
>
>
>
> > SLAB_NO_MERGE will avoid merging fclone with kmalloc-512, it does not
> > really help.
> >
> > I think we need help from mm/slub experts, instead of trying to 'fix'
> > networking stacks.
>
> @Christopher
>
> Any idea about this?

I had a simpler bpftrace program to get an histogram of [my_node,
node(sk_buff), node_of(skb->head)]
and can tell that going back to linux-v6.7 and CONFIG_SLAB=y solved
all the issues for me.

99.9999% of SLAB allocations were on the right node.

This looks like a SLUB bug to me.