Re: [RFC] tcp: Delay sending non-probes for RFC4821 mtu probing

From: Neal Cardwell
Date: Mon Apr 26 2021 - 12:00:08 EST


On Sun, Apr 25, 2021 at 10:34 PM Leonard Crestez <lcrestez@xxxxxxxxxxxxx> wrote:
>
> On 4/21/21 3:47 PM, Neal Cardwell wrote:
> > On Wed, Apr 21, 2021 at 6:21 AM Leonard Crestez <cdleonard@xxxxxxxxx> wrote:
> >>
> >> According to RFC4821 Section 7.4 "Protocols MAY delay sending non-probes
> >> in order to accumulate enough data" but linux almost never does that.
> >>
> >> Linux waits for probe_size + (1 + retries) * mss_cache to be available
> >> in the send buffer and if that condition is not met it will send anyway
> >> using the current MSS. The feature can be made to work by sending very
> >> large chunks of data from userspace (for example 128k) but for small writes
> >> on fast links probes almost never happen.
> >>
> >> This patch tries to implement the "MAY" by adding an extra flag
> >> "wait_data" to icsk_mtup which is set to 1 if a probe is possible but
> >> insufficient data is available. Then data is held back in
> >> tcp_write_xmit until a probe is sent, probing conditions are no longer
> >> met, or 500ms pass.
> >>
> >> Signed-off-by: Leonard Crestez <cdleonard@xxxxxxxxx>
> >>
> >> ---
> >> Documentation/networking/ip-sysctl.rst | 4 ++
> >> include/net/inet_connection_sock.h | 7 +++-
> >> include/net/netns/ipv4.h | 1 +
> >> include/net/tcp.h | 2 +
> >> net/ipv4/sysctl_net_ipv4.c | 7 ++++
> >> net/ipv4/tcp_ipv4.c | 1 +
> >> net/ipv4/tcp_output.c | 54 ++++++++++++++++++++++++--
> >> 7 files changed, 71 insertions(+), 5 deletions(-)
> >>
> >> My tests are here: https://github.com/cdleonard/test-tcp-mtu-probing
> >>
> >> This patch makes the test pass quite reliably with
> >> ICMP_BLACKHOLE=1 TCP_MTU_PROBING=1 IPERF_WINDOW=256k IPERF_LEN=8k while
> >> before it only worked with much higher IPERF_LEN=256k
> >>
> >> In my loopback tests I also observed another issue when tcp_retries
> >> increases because of SACKReorder. This makes the original problem worse
> >> (since the retries amount factors in buffer requirement) and seems to be
> >> unrelated issue. Maybe when loss happens due to MTU shrinkage the sender
> >> sack logic is confused somehow?
> >>
> >> I know it's towards the end of the cycle but this is mostly just intended for
> >> discussion.
> >
> > Thanks for raising the question of how to trigger PMTU probes more often!
> >
> > AFAICT this approach would cause unacceptable performance impacts by
> > often injecting unnecessary 500ms delays when there is no need to do
> > so.
> >
> > If the goal is to increase the frequency of PMTU probes, which seems
> > like a valid goal, I would suggest that we rethink the Linux heuristic
> > for triggering PMTU probes in the light of the fact that the loss
> > detection mechanism is now RACK-TLP, which provides quick recovery in
> > a much wider variety of scenarios.
>
> > After all, https://tools.ietf.org/html/rfc4821#section-7.4 says:
> >
> > In addition, the timely loss detection algorithms in most protocols
> > have pre-conditions that SHOULD be satisfied before sending a probe.
> >
> > And we know that the "timely loss detection algorithms" have advanced
> > since this RFC was written in 2007. >
> > You mention:
> >> Linux waits for probe_size + (1 + retries) * mss_cache to be available
> >
> > The code in question seems to be:
> >
> > size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;
>
> As far as I understand this is meant to work with classical retransmit:
> if 3 dupacks are received then the first segment is considered lost and
> probe success or failure is can determine within roughly 1*rtt.

Yes, that is my sense as well.

> RACK
> marks segments as lost based on echoed timestamps so it doesn't need
> multiple segments. The minimum time interval is only a little higher
> (5/4 rtt). Is this correct?

That's basically the case, though RACK doesn't even require echoed timestamps.

> > How about just changing this to:
> >
> > size_needed = probe_size + tp->mss_cache;
> >
> > The rationale would be that if that amount of data is available, then
> > the sender can send one probe and one following current-mss-size
> > packet. If the path MTU has not increased to allow the probe of size
> > probe_size to pass through the network, then the following
> > current-mss-size packet will likely pass through the network, generate
> > a SACK, and trigger a RACK fast recovery 1/4*min_rtt later, when the
> > RACK reorder timer fires.
>
> This appears to almost work except it stalls after a while. I spend some
> time investigating it and it seems that cwnd is shrunk on mss increases
> and does not go back up. This causes probes to be skipped because of a
> "snd_cwnd < 11" condition.
>
> I don't undestand where that magical "11" comes from, could that be
> shrunk. Maybe it's meant to only send probes when the cwnd is above the
> default of 10? Then maybe mtu_probe_success shouldn't shrink mss below
> what is required for an additional probe, or at least round-up.
>
> The shrinkage of cwnd is a problem with this "short probes" approach
> because tcp_is_cwnd_limited returns false because tp->max_packets_out is
> smaller (4). With longer probes tp->max_packets_out is larger (6) so
> tcp_is_cwnd_limited returns true even for a cwnd of 10.
>
> I'm testing using namespace-to-namespace loopback so my delays are close
> to zero. I tried to introduce an artificial delay of 30ms (using tc
> netem) and it works but 20ms does not.

I agree the magic 11 seems outdated and unnecessarily high, given RACK-TLP.

I think it would be fine to change the magic 11 to a magic
(TCP_FASTRETRANS_THRESH+1), aka 3+1=4:

- tp->snd_cwnd < 11 ||
+ p->snd_cwnd < (TCP_FASTRETRANS_THRESH + 1) ||

As long as the cwnd is >= TCP_FASTRETRANS_THRESH+1 then the sender
should usually be able to send the 1 probe packet and then 3
additional packets beyond the probe, and in the common case (with no
reordering) then with failed probes this should allow the sender to
quickly receive 3 SACKed segments and enter fast recovery quickly.
Even if the sender doesn't have 3 additional packets, or if reordering
has been detected, then RACK-TLP should be able to start recovery
quickly (5/4*RTT if there is at least one SACK, or 2*RTT for a TLP if
there is no SACK).

> > A secondary rationale for this heuristic would be: if the flow never
> > accumulates roughly two packets worth of data, then does the flow
> > really need a bigger packet size?
>
> The problem is that "accumulating sufficient data" is an extremely fuzzy
> concept. In particular it seems that at the same traffic level
> performing shorter writes from userspace (2kb instead of 64k) can
> prevent mtu probing entirely and this is unreasonable.

Something like your autocorking-enhanced-PMTU patch sounds like a
reasonable way to deal with this.

thanks,
neal