Re: [PATCH net] udp: Allow kernel service to avoid udp socket rx queue

From: David Howells
Date: Thu Oct 04 2018 - 05:37:47 EST


Paolo Abeni <pabeni@xxxxxxxxxx> wrote:

> > There's a problem somewhere skb_recv_udp() that doesn't decrease the
> > sk_rmem_alloc counter when a packet is extracted from the receive queue by
> > a kernel service.
>
> If this is the case, it's really bad and need an explicit fix. However
> it looks like sk_rmem_alloc is reclaimed by skb_recv_udp(), as it ends-
> up calling udp_rmem_release() on succesfull dequeue.

It certainly *looks* like it should do that, but nonetheless, the tracepoint I
put in shows it going up and up. I can try putting in a tracepoint by the
subtraction, see what that shows.

> > Further, there doesn't seem any point in having the socket buffer being
> > added to the UDP socket's rx queue since the rxrpc's data_ready handler
> > takes it straight back out again (more or less, there seem to be occasional
> > hold ups there).
>
> I really would really try to avoid adding another indirect call in the
> data-path, unless strictly needed (to avoid more RETPOLINE overhead for
> all other use-case). If skipping altogether the enqueuing makes sense
> (I guess so, mostily for performance reasons), I *think* you can use
> the already existing encap_rcv hook, initializing it to the rxrpc input
> function, and updating such function to pull the udp header and ev.
> initializing the pktinfo, if needed. Please see e.g. l2tp usage.

I looked at that, but it seems that a global conditional is required to enable
it - presumably for performance reasons. I presume I would need to:

(1) Allocate a new UDP_ENCAP_* flag.

(2) Replicate at least some of the stuff that gets done between the check in
udp_queue_rcv_skb() and the call of __udp_enqueue_schedule_skb() such as
calling packet filtering.

I'm not sure whether I need to call things like ipv4_pktinfo_prepare(),
sock_rps_save_rxhash(), sk_mark_napi_id() or sk_incoming_cpu_update() -
are they of necessity to the UDP socket?

> > Putting in some tracepoints show a significant delay occurring between packets
> > coming in and thence being delivered to rxrpc:
> >
> > <idle>-0 [001] ..s2 67.631844: net_rtl8169_napi_rx: enp3s0 skb=07db0a32
> > ...
> > <idle>-0 [001] ..s4 68.292778: rxrpc_rx_packet: d5ce8d37:bdb93c60:00000002:09c7 00000006 00000000 02 20 ACK 660967981 skb=07db0a32
> >
> > The "660967981" is the time difference in nanoseconds between the sk_buff
> > timestamp and the current time. It seems to match the time elapsed between
> > the two trace lines reasonably well. I've seen anything up to about 4s.
>
> Can you please provide more data?

I can give you a whole trace if you like.

> specifically can you please add:
> * a perf probe in rxrpc_data_ready() just after skb_recv_udp()
> reporting the sk->sk_rmem_alloc and skb->truesize
> * a perf probe in __udp_enqueue_schedule_skb() just before the 'if
> (rmem > sk->sk_rcvbuf)' test reporting again sk->sk_rmem_alloc, skb-
> >truesize and sk->sk_rcvbuf
> And then provide the perf record -g -e ... /perf script output?

Can't this be done by putting tracepoints there instead? I don't know how to
do the perf stuff. What can that get that can't be obtained with a
tracepoint?

Thanks,
David