RE: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support
From: wangyunjian
Date: Mon Jan 29 2024 - 06:10:57 EST
> -----Original Message-----
> From: Jason Wang [mailto:jasowang@xxxxxxxxxx]
> Sent: Monday, January 29, 2024 11:05 AM
> To: wangyunjian <wangyunjian@xxxxxxxxxx>
> Cc: mst@xxxxxxxxxx; willemdebruijn.kernel@xxxxxxxxx; kuba@xxxxxxxxxx;
> davem@xxxxxxxxxxxxx; magnus.karlsson@xxxxxxxxx; netdev@xxxxxxxxxxxxxxx;
> linux-kernel@xxxxxxxxxxxxxxx; kvm@xxxxxxxxxxxxxxx;
> virtualization@xxxxxxxxxxxxxxx; xudingke <xudingke@xxxxxxxxxx>
> Subject: Re: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support
>
> On Sat, Jan 27, 2024 at 5:34 PM wangyunjian <wangyunjian@xxxxxxxxxx>
> wrote:
> >
> > > > -----Original Message-----
> > > > From: Jason Wang [mailto:jasowang@xxxxxxxxxx]
> > > > Sent: Thursday, January 25, 2024 12:49 PM
> > > > To: wangyunjian <wangyunjian@xxxxxxxxxx>
> > > > Cc: mst@xxxxxxxxxx; willemdebruijn.kernel@xxxxxxxxx;
> > > > kuba@xxxxxxxxxx; davem@xxxxxxxxxxxxx; magnus.karlsson@xxxxxxxxx;
> > > > netdev@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
> > > > kvm@xxxxxxxxxxxxxxx; virtualization@xxxxxxxxxxxxxxx; xudingke
> > > > <xudingke@xxxxxxxxxx>
> > > > Subject: Re: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support
> > > >
> > > > On Wed, Jan 24, 2024 at 5:38 PM Yunjian Wang
> > > <wangyunjian@xxxxxxxxxx>
> > > > wrote:
> > > > >
> > > > > Now the zero-copy feature of AF_XDP socket is supported by some
> > > > > drivers, which can reduce CPU utilization on the xdp program.
> > > > > This patch set allows tun to support AF_XDP Rx zero-copy feature.
> > > > >
> > > > > This patch tries to address this by:
> > > > > - Use peek_len to consume a xsk->desc and get xsk->desc length.
> > > > > - When the tun support AF_XDP Rx zero-copy, the vq's array maybe
> empty.
> > > > > So add a check for empty vq's array in vhost_net_buf_produce().
> > > > > - add XDP_SETUP_XSK_POOL and ndo_xsk_wakeup callback support
> > > > > - add tun_put_user_desc function to copy the Rx data to VM
> > > >
> > > > Code explains themselves, let's explain why you need to do this.
> > > >
> > > > 1) why you want to use peek_len
> > > > 2) for "vq's array", what does it mean?
> > > > 3) from the view of TUN/TAP tun_put_user_desc() is the TX path, so
> > > > I guess you meant TX zerocopy instead of RX (as I don't see codes
> > > > for
> > > > RX?)
> > >
> > > OK, I agree and use TX zerocopy instead of RX zerocopy. I meant RX
> > > zerocopy from the view of vhost-net.
> > >
> > > >
> > > > A big question is how could you handle GSO packets from
> userspace/guests?
> > >
> > > Now by disabling VM's TSO and csum feature. XDP does not support GSO
> > > packets.
> > > However, this feature can be added once XDP supports it in the future.
> > >
> > > >
> > > > >
> > > > > Signed-off-by: Yunjian Wang <wangyunjian@xxxxxxxxxx>
> > > > > ---
> > > > > drivers/net/tun.c | 165
> > > > +++++++++++++++++++++++++++++++++++++++++++-
> > > > > drivers/vhost/net.c | 18 +++--
> > > > > 2 files changed, 176 insertions(+), 7 deletions(-)
> >
> > [...]
> >
> > > > >
> > > > > static int peek_head_len(struct vhost_net_virtqueue *rvq,
> > > > > struct sock
> > > > > *sk) {
> > > > > + struct socket *sock = sk->sk_socket;
> > > > > struct sk_buff *head;
> > > > > int len = 0;
> > > > > unsigned long flags;
> > > > >
> > > > > - if (rvq->rx_ring)
> > > > > - return vhost_net_buf_peek(rvq);
> > > > > + if (rvq->rx_ring) {
> > > > > + len = vhost_net_buf_peek(rvq);
> > > > > + if (likely(len))
> > > > > + return len;
> > > > > + }
> > > > > +
> > > > > + if (sock->ops->peek_len)
> > > > > + return sock->ops->peek_len(sock);
> > > >
> > > > What prevents you from reusing the ptr_ring here? Then you don't
> > > > need the above tricks.
> > >
> > > Thank you for your suggestion. I will consider how to reuse the ptr_ring.
> >
> > If ptr_ring is used to transfer xdp_descs, there is a problem: After
> > some xdp_descs are obtained through xsk_tx_peek_desc(), the descs may
> > fail to be added to ptr_ring. However, no API is available to
> > implement the rollback function.
>
> I don't understand, this issue seems to exist in the physical NIC as well?
>
> We get more descriptors than the free slots in the NIC ring.
>
> How did other NIC solve this issue?
Currently, physical NICs such as i40e, ice, ixgbe, igc, and mlx5 obtains
available NIC descriptors and then retrieve the same number of xsk
descriptors for processing.
Thanks
>
> Thanks
>
> >
> > Thanks
> >
> > >
> > > >
> > > > Thanks
> > > >
> > > >
> > > > >
> > > > > spin_lock_irqsave(&sk->sk_receive_queue.lock, flags);
> > > > > head = skb_peek(&sk->sk_receive_queue);
> > > > > --
> > > > > 2.33.0
> > > > >
> >