[PATCH net-next 00/37] rxrpc: Implement jumbo DATA transmission and RACK-TLP

From: David Howells
Date: Mon Dec 02 2024 - 12:19:06 EST


Here's a series of patches to implement two main features:

(1) The transmission of jumbo data packets whereby several DATA packets of
a particular size can be glued together into a single UDP packet,
allowing us to make use of larger MTU sizes. The basic jumbo
subpacket capacity is 1412 bytes (RXRPC_JUMBO_DATALEN) and, say, an
MTU of 8192 allows five of them to be transmitted as one.

An alternative (and possibly more efficient way) would be to
expand/shrink the capacity of each DATA packet to match the MTU and
thus save on header and tail-gap overhead, but the Rx protocol does
not provide a mechanism for splitting the data - especially as the
transported data is encrypted per-packet - and so UDP fragmentation
would be the only way to handle this.

In fact, in the future, AF_RXRPC also needs to look at shrinking the
packet size where the MTU is smaller - for instance in the case of
being carried by IPv6 over wifi where there isn't capacity for a 1412
byte capacity.

(2) RACK-TLP to manage packet loss and retransmission in conjunction with
the congestion control algorithm.

These allow for better data throughput and work towards being able to have
larger transmission windows.

To this end, the following changes are also made:

(1) Use a single large array of kvec structs for the I/O thread rather
than having one per transmission buffer. We need a much bigger
collection of kvecs for ping padding

(2) Implement path-MTU probing by sending padded PING ACK packets and
monitoring for PING RESPONSE ACKs. The pmtud value determined is used
to configure the construction of jumbo DATA packets.

(3) The transmission queue is changed from a linked list of transmission
buffer structs to a linked list of transmission-queue structs, each of
which points to either 32 or 64 transmission buffers (depending on cpu
word size) and various bits of metadata are concentrated in the queue
structs rather than the buffers to make better use of the cpu cache.

(4) SACK data is stored in the transmission-queue structures in batches of
32 or 64 making it faster to process rather than being spread amongst
all the individual packet buffers.

(5) Don't change the DF flag on the UDP socket unless we need to - and
basically only enable it for path-MTU probing.

There are also some additional bits:

(1) Fix the handling of connection aborts to poke the aborted connections.

(2) Don't set the MORE-PACKETS Rx header flag on the wire. No one
actually checks it and it is, in any case, generated inconsistently
between implementations.

(3) Request an ACK when, during call transmission, there's a stall in the
app generating the data to be transmitted.

(4) Fix attention starvation in the I/O thread by making sure we go
through all outstanding events rather than returning to the beginning
of the check cycle after any time we process an event.

(5) Don't use the skbuff timestamp in the calculation of timeouts and RTT
as we really should include local processing time in that too.
Further, getting receive skbuff timestamps may be expensive.

(6) Make RTT tracking per call with the saving of the value between calls,
even within the same connection channel. The initial call timeout
starts off large to allow the server time to set up its state before
the initial reply.

(7) Don't allocate txbuf structs for ACK packets, but rather use page
frags and MSG_SPLICE_PAGES.

(8) Use irq-disabling locks for interactions between app threads and I/O
threads so that the I/O thread doesn't get help up.

(9) Make rxrpc set the REQUEST-ACK flag on an outgoing packet when cwnd is
at RXRPC_MIN_CWND (currently 4), not at 2 which it can never reach.

(10) Add some tracing bits and pieces (including displaying the userStatus
field in an ACK header) and some more stats counters (including
different sizes of jumbo packets sent/received).

The patches can also be found on this branch:

http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=rxrpc-iothread

David

Link: https://lore.kernel.org/r/20240306000655.1100294-1-dhowells@xxxxxxxxxx/ [1]

David Howells (37):
rxrpc: Fix handling of received connection abort
rxrpc: Use umin() and umax() rather than min_t()/max_t() where
possible
rxrpc: Clean up Tx header flags generation handling
rxrpc: Don't set the MORE-PACKETS rxrpc wire header flag
rxrpc: Show stats counter for received reason-0 ACKs
rxrpc: Request an ACK on impending Tx stall
rxrpc: Use a large kvec[] in rxrpc_local rather than every rxrpc_txbuf
rxrpc: Implement path-MTU probing using padded PING ACKs (RFC8899)
rxrpc: Separate the packet length from the data length in rxrpc_txbuf
rxrpc: Prepare to be able to send jumbo DATA packets
rxrpc: Add a tracepoint to show variables pertinent to jumbo packet
size
rxrpc: Fix CPU time starvation in I/O thread
rxrpc: Fix injection of packet loss
rxrpc: Only set DF=1 on initial DATA transmission
rxrpc: Timestamp DATA packets before transmitting them
rxrpc: Implement progressive transmission queue struct
rxrpc: call->acks_hard_ack is now the same call->tx_bottom, so remove
it
rxrpc: Replace call->acks_first_seq with tracking of the hard ACK
point
rxrpc: Display stats about jumbo packets transmitted and received
rxrpc: Adjust names and types of congestion-related fields
rxrpc: Use the new rxrpc_tx_queue struct to more efficiently process
ACKs
rxrpc: Store the DATA serial in the txqueue and use this in RTT calc
rxrpc: Don't use received skbuff timestamps
rxrpc: Generate rtt_min
rxrpc: Adjust the rxrpc_rtt_rx tracepoint
rxrpc: Display userStatus in rxrpc_rx_ack trace
rxrpc: Fix the calculation and use of RTO
rxrpc: Fix initial resend timeout
rxrpc: Send jumbo DATA packets
rxrpc: Don't allocate a txbuf for an ACK transmission
rxrpc: Use irq-disabling spinlocks between app and I/O thread
rxrpc: Tidy up the ACK parsing a bit
rxrpc: Add a reason indicator to the tx_data tracepoint
rxrpc: Add a reason indicator to the tx_ack tracepoint
rxrpc: Manage RTT per-call rather than per-peer
rxrpc: Fix request for an ACK when cwnd is minimum
rxrpc: Implement RACK/TLP to deal with transmission stalls [RFC8985]

include/trace/events/rxrpc.h | 878 ++++++++++++++++++++++++++++++-----
lib/win_minmax.c | 1 +
net/rxrpc/Makefile | 1 +
net/rxrpc/af_rxrpc.c | 4 +-
net/rxrpc/ar-internal.h | 339 +++++++++++---
net/rxrpc/call_accept.c | 22 +-
net/rxrpc/call_event.c | 385 ++++++++-------
net/rxrpc/call_object.c | 67 +--
net/rxrpc/conn_client.c | 26 +-
net/rxrpc/conn_event.c | 38 +-
net/rxrpc/conn_object.c | 14 +-
net/rxrpc/input.c | 706 +++++++++++++++++-----------
net/rxrpc/input_rack.c | 422 +++++++++++++++++
net/rxrpc/insecure.c | 5 +-
net/rxrpc/io_thread.c | 109 ++---
net/rxrpc/local_object.c | 3 -
net/rxrpc/misc.c | 4 +-
net/rxrpc/output.c | 557 ++++++++++++++--------
net/rxrpc/peer_event.c | 112 ++++-
net/rxrpc/peer_object.c | 30 +-
net/rxrpc/proc.c | 58 ++-
net/rxrpc/protocol.h | 13 +-
net/rxrpc/recvmsg.c | 18 +-
net/rxrpc/rtt.c | 103 ++--
net/rxrpc/rxkad.c | 59 ++-
net/rxrpc/rxperf.c | 2 +-
net/rxrpc/security.c | 4 +-
net/rxrpc/sendmsg.c | 88 +++-
net/rxrpc/sysctl.c | 6 +-
net/rxrpc/txbuf.c | 127 +----
30 files changed, 2968 insertions(+), 1233 deletions(-)
create mode 100644 net/rxrpc/input_rack.c