Re: [PATCH net] udp: Allow kernel service to avoid udp socket rx queue

From: Paolo Abeni
Date: Thu Oct 04 2018 - 05:08:16 EST


Hi,

On Wed, 2018-10-03 at 19:48 +0100, David Howells wrote:
> There's a problem somewhere skb_recv_udp() that doesn't decrease the
> sk_rmem_alloc counter when a packet is extracted from the receive queue by
> a kernel service.

If this is the case, it's really bad and need an explicit fix. However
it looks like sk_rmem_alloc is reclaimed by skb_recv_udp(), as it ends-
up calling udp_rmem_release() on succesfull dequeue.

udp_rmem_release() can delay sk_rmem_alloc reclaiming is the rx queue
is almost empty, due to commit
6b229cf77d683f634f0edd876c6d1015402303ad. Anyhow I don't see either as
that may affect this scenario: if only 1/4 of the maxium receive buffer
size is used, the next packet should always be able to land into the rx
queue, with the default sk_rcvbuf and every reasonable truesize.

> Further, there doesn't seem any point in having the socket buffer being
> added to the UDP socket's rx queue since the rxrpc's data_ready handler
> takes it straight back out again (more or less, there seem to be occasional
> hold ups there).

I really would really try to avoid adding another indirect call in the
data-path, unless strictly needed (to avoid more RETPOLINE overhead for
all other use-case). If skipping altogether the enqueuing makes sense
(I guess so, mostily for performance reasons), I *think* you can use
the already existing encap_rcv hook, initializing it to the rxrpc input
function, and updating such function to pull the udp header and ev.
initializing the pktinfo, if needed. Please see e.g. l2tp usage.

> Putting in some tracepoints show a significant delay occurring between packets
> coming in and thence being delivered to rxrpc:
>
> <idle>-0 [001] ..s2 67.631844: net_rtl8169_napi_rx: enp3s0 skb=07db0a32
> ...
> <idle>-0 [001] ..s4 68.292778: rxrpc_rx_packet: d5ce8d37:bdb93c60:00000002:09c7 00000006 00000000 02 20 ACK 660967981 skb=07db0a32
>
> The "660967981" is the time difference in nanoseconds between the sk_buff
> timestamp and the current time. It seems to match the time elapsed between
> the two trace lines reasonably well. I've seen anything up to about 4s.

Can you please provide more data? specifically can you please add:
* a perf probe in rxrpc_data_ready() just after skb_recv_udp()
reporting the sk->sk_rmem_alloc and skb->truesize
* a perf probe in __udp_enqueue_schedule_skb() just before the 'if
(rmem > sk->sk_rcvbuf)' test reporting again sk->sk_rmem_alloc, skb-
>truesize and sk->sk_rcvbuf

And then provide the perf record -g -e ... /perf script output?

Thanks,

Paolo