Re: [RFC 1/3] tcp: Consider mtu probing for tcp_xmit_size_goal
From: Leonard Crestez
Date: Mon May 17 2021 - 09:42:44 EST
On 5/11/21 4:04 PM, Eric Dumazet wrote:
On Tue, May 11, 2021 at 2:04 PM Leonard Crestez <cdleonard@xxxxxxxxx> wrote:
According to RFC4821 Section 7.4 "Protocols MAY delay sending non-probes
in order to accumulate enough data" but linux almost never does that.
Linux checks for (probe_size + (1 + reorder) * mss_cache) bytes to be
available in the send buffer and if that condition is not met it will
send anyway using the current MSS. The feature can be made to work by
sending very large chunks of data from userspace (for example 128k) but
for small writes on fast links tcp mtu probes almost never happen.
Why should they happen ?
I am not sure the kernel should perform extra checks just because
applications are not properly written.
My tests show that application writing a few kb at a time almost never
trigger MTU probing enough to reach 9200. The reasons for this are very
difficult for me to understand.
It seems that only writing in very large chunks like 160k makes it
happen, much more than the size_needed calculated inside tcp_mtu_probing
(which is about 50k). This seems unreasonable. Ideally linux should try
to accumulate enough data for a probe (as the RFC suggests) but at least
it should send probes that fit inside a single userspace write.
I dug a little deeper and what seems to happen is this:
* size_needed is ~60k
* once the head of the queue reached size_needed tcp_push_one is
called which sends everything ignoring MTU probing
* size_needed is reached again and tcp_push_pending_frames is called.
At this point the cwnd has shrunk < 11 (due to the previous burst) so
probing is skipped again in favor of just sending in mss-sized chunks.
This happens repeatedly, a sender-limited app performing periodic 128k
writes will see MSS stuck below MTU.
I don't understand the push_one logic and why it completely skips mtu
probing, it seems like an optimization which doesn't take RFC4821 into
account.
This patch tries to take mtu probe into account in tcp_xmit_size_goal, a
function which otherwise attempts to accumulate a packet suitable for
TSO. No delays are introduced beyond existing autocork heuristics.
MTU probing should not be attempted for every write().
This belongs to some kind of slow path, once in a while.
MTU probing is only attempted every 10 minutes but once a probe is
pending it does have a slight impact on every write. This is already the
case, tcp_write_xmit calls tcp_mtu_probe almost every time.
I had an idea for reducing the overhead in tcp_size_needed but it turns
out I was indeed mistaken about what this function does. I thought it
returned ~mss when all GSO is disabled but this is not so.
static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
int large_allowed)
{
+ struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
u32 new_size_goal, size_goal;
if (!large_allowed)
return mss_now;
@@ -932,11 +933,19 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
tp->gso_segs = min_t(u16, new_size_goal / mss_now,
sk->sk_gso_max_segs);
size_goal = tp->gso_segs * mss_now;
}
- return max(size_goal, mss_now);
+ size_goal = max(size_goal, mss_now);
+
+ if (unlikely(icsk->icsk_mtup.wait_data)) {
+ int mtu_probe_size_needed = tcp_mtu_probe_size_needed(sk, NULL);
+ if (mtu_probe_size_needed > 0)
+ size_goal = max(size_goal, (u32)mtu_probe_size_needed);
+ }
I think you are mistaken.
This function usually returns 64KB depending on MSS.
Have you really tested this part ?
I assumed that with all gso features disabled this function returns one
MSS but this is not true. My patch had a positive effect just because I
made tcp_mtu_probing return "0" instead of "-1" if not enough data is
queued.
I don't fully understand the implications of that change though. If
tcp_mtu_probe returns zero what guarantee is there that data will
eventually be sent even if no further userspace writes happen?
I'd welcome any suggestions.
--
Regards,
Leonard