Fwd: [RFC] tcp: Delay sending non-probes for RFC4821 mtu probing

From: Matt Mathis
Date: Wed Apr 21 2021 - 12:46:09 EST


(Resending in plain text mode)

Surely there is a way to adapt tcp_tso_should_defer(), it is trying to
solve a similar problem.

If I were to implement PLPMTUD today, I would more deeply entwine it
into TCP's support for TSO. e.g. successful deferring segments
sometimes enables TSO and sometimes enables PLPMTUD.

But there is a deeper question: John Heffner and I invested a huge
amount of energy in trying to make PLPMTUD work for opportunistic
Jumbo discovery, only to discover that we had moved the problem down
to the device driver/nic, were it isn't so readily solvable.

The driver needs to carve nic buffer memory before it can communicate
with a switch (to either ask or measure the MTU), and once it has done
that it needs to either re-carve the memory or run with suboptimal
carving. Both of these are problematic.

There is also a problem that many link technologies will
non-deterministically deliver jumbo frames at greatly increased error
rates. This issue requires a long conversation on it's own.

Thanks,
--MM--
The best way to predict the future is to create it. - Alan Kay

We must not tolerate intolerance;
however our response must be carefully measured:
too strong would be hypocritical and risks spiraling out of control;
too weak risks being mistaken for tacit approval.


On Wed, Apr 21, 2021 at 5:48 AM Neal Cardwell <ncardwell@xxxxxxxxxx> wrote:
>
> On Wed, Apr 21, 2021 at 6:21 AM Leonard Crestez <cdleonard@xxxxxxxxx> wrote:
> >
> > According to RFC4821 Section 7.4 "Protocols MAY delay sending non-probes
> > in order to accumulate enough data" but linux almost never does that.
> >
> > Linux waits for probe_size + (1 + retries) * mss_cache to be available
> > in the send buffer and if that condition is not met it will send anyway
> > using the current MSS. The feature can be made to work by sending very
> > large chunks of data from userspace (for example 128k) but for small writes
> > on fast links probes almost never happen.
> >
> > This patch tries to implement the "MAY" by adding an extra flag
> > "wait_data" to icsk_mtup which is set to 1 if a probe is possible but
> > insufficient data is available. Then data is held back in
> > tcp_write_xmit until a probe is sent, probing conditions are no longer
> > met, or 500ms pass.
> >
> > Signed-off-by: Leonard Crestez <cdleonard@xxxxxxxxx>
> >
> > ---
> > Documentation/networking/ip-sysctl.rst | 4 ++
> > include/net/inet_connection_sock.h | 7 +++-
> > include/net/netns/ipv4.h | 1 +
> > include/net/tcp.h | 2 +
> > net/ipv4/sysctl_net_ipv4.c | 7 ++++
> > net/ipv4/tcp_ipv4.c | 1 +
> > net/ipv4/tcp_output.c | 54 ++++++++++++++++++++++++--
> > 7 files changed, 71 insertions(+), 5 deletions(-)
> >
> > My tests are here: https://github.com/cdleonard/test-tcp-mtu-probing
> >
> > This patch makes the test pass quite reliably with
> > ICMP_BLACKHOLE=1 TCP_MTU_PROBING=1 IPERF_WINDOW=256k IPERF_LEN=8k while
> > before it only worked with much higher IPERF_LEN=256k
> >
> > In my loopback tests I also observed another issue when tcp_retries
> > increases because of SACKReorder. This makes the original problem worse
> > (since the retries amount factors in buffer requirement) and seems to be
> > unrelated issue. Maybe when loss happens due to MTU shrinkage the sender
> > sack logic is confused somehow?
> >
> > I know it's towards the end of the cycle but this is mostly just intended for
> > discussion.
>
> Thanks for raising the question of how to trigger PMTU probes more often!
>
> AFAICT this approach would cause unacceptable performance impacts by
> often injecting unnecessary 500ms delays when there is no need to do
> so.
>
> If the goal is to increase the frequency of PMTU probes, which seems
> like a valid goal, I would suggest that we rethink the Linux heuristic
> for triggering PMTU probes in the light of the fact that the loss
> detection mechanism is now RACK-TLP, which provides quick recovery in
> a much wider variety of scenarios.
>
> After all, https://tools.ietf.org/html/rfc4821#section-7.4 says:
>
> In addition, the timely loss detection algorithms in most protocols
> have pre-conditions that SHOULD be satisfied before sending a probe.
>
> And we know that the "timely loss detection algorithms" have advanced
> since this RFC was written in 2007.
>
> You mention:
> > Linux waits for probe_size + (1 + retries) * mss_cache to be available
>
> The code in question seems to be:
>
> size_needed = probe_size + (tp->reordering + 1) * tp->mss_cache;
>
> How about just changing this to:
>
> size_needed = probe_size + tp->mss_cache;
>
> The rationale would be that if that amount of data is available, then
> the sender can send one probe and one following current-mss-size
> packet. If the path MTU has not increased to allow the probe of size
> probe_size to pass through the network, then the following
> current-mss-size packet will likely pass through the network, generate
> a SACK, and trigger a RACK fast recovery 1/4*min_rtt later, when the
> RACK reorder timer fires.
>
> A secondary rationale for this heuristic would be: if the flow never
> accumulates roughly two packets worth of data, then does the flow
> really need a bigger packet size?
>
> IMHO, just reducing the size_needed seems far preferable to needlessly
> injecting 500ms delays.
>
> best,
> neal