On Wed, Apr 21, 2021 at 6:21 AM Leonard Crestez <cdleonard@xxxxxxxxx> wrote:
According to RFC4821 Section 7.4 "Protocols MAY delay sending non-probes
in order to accumulate enough data" but linux almost never does that.
Linux waits for probe_size + (1 + retries) * mss_cache to be available
in the send buffer and if that condition is not met it will send anyway
using the current MSS. The feature can be made to work by sending very
large chunks of data from userspace (for example 128k) but for small writes
on fast links probes almost never happen.
This patch tries to implement the "MAY" by adding an extra flag
"wait_data" to icsk_mtup which is set to 1 if a probe is possible but
insufficient data is available. Then data is held back in
tcp_write_xmit until a probe is sent, probing conditions are no longer
met, or 500ms pass.
Signed-off-by: Leonard Crestez <cdleonard@xxxxxxxxx>
---
Documentation/networking/ip-sysctl.rst | 4 ++
include/net/inet_connection_sock.h | 7 +++-
include/net/netns/ipv4.h | 1 +
include/net/tcp.h | 2 +
net/ipv4/sysctl_net_ipv4.c | 7 ++++
net/ipv4/tcp_ipv4.c | 1 +
net/ipv4/tcp_output.c | 54 ++++++++++++++++++++++++--
7 files changed, 71 insertions(+), 5 deletions(-)
My tests are here: https://github.com/cdleonard/test-tcp-mtu-probing
This patch makes the test pass quite reliably with
ICMP_BLACKHOLE=1 TCP_MTU_PROBING=1 IPERF_WINDOW=256k IPERF_LEN=8k while
before it only worked with much higher IPERF_LEN=256k
In my loopback tests I also observed another issue when tcp_retries
increases because of SACKReorder. This makes the original problem worse
(since the retries amount factors in buffer requirement) and seems to be
unrelated issue. Maybe when loss happens due to MTU shrinkage the sender
sack logic is confused somehow?
I know it's towards the end of the cycle but this is mostly just intended for
discussion.
Thanks for raising the question of how to trigger PMTU probes more often!
AFAICT this approach would cause unacceptable performance impacts by
often injecting unnecessary 500ms delays when there is no need to do
so.
If the goal is to increase the frequency of PMTU probes, which seems
like a valid goal, I would suggest that we rethink the Linux heuristic
for triggering PMTU probes in the light of the fact that the loss
detection mechanism is now RACK-TLP, which provides quick recovery in
a much wider variety of scenarios.
After all, https://tools.ietf.org/html/rfc4821#section-7.4 says:
In addition, the timely loss detection algorithms in most protocols
have pre-conditions that SHOULD be satisfied before sending a probe.
And we know that the "timely loss detection algorithms" have advanced
since this RFC was written in 2007. >
You mention:
Linux waits for probe_size + (1 + retries) * mss_cache to be available
The code in question seems to be:
size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;
How about just changing this to:
size_needed = probe_size + tp->mss_cache;
The rationale would be that if that amount of data is available, then
the sender can send one probe and one following current-mss-size
packet. If the path MTU has not increased to allow the probe of size
probe_size to pass through the network, then the following
current-mss-size packet will likely pass through the network, generate
a SACK, and trigger a RACK fast recovery 1/4*min_rtt later, when the
RACK reorder timer fires.
A secondary rationale for this heuristic would be: if the flow never
accumulates roughly two packets worth of data, then does the flow
really need a bigger packet size?