Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels

From: John Heffner
Date: Thu Sep 29 2005 - 11:05:47 EST


On Thursday 29 September 2005 11:17 am, Alexey Kuznetsov wrote:
> Hello!
>
> > >Anyway, ignoring this puzzle, the following patch for 2.4 should help.
> > >
> > >
> > >--- net/ipv4/tcp_input.c.orig 2003-02-20 20:38:39.000000000 +0300
> > >+++ net/ipv4/tcp_input.c 2005-09-02 22:28:00.845952888 +0400
> > >@@ -343,8 +343,6 @@
> > > app_win -= tp->ack.rcv_mss;
> > > app_win = max(app_win, 2U*tp->advmss);
> > >
> > >- if (!ofo_win)
> > >- tp->window_clamp = min(tp->window_clamp, app_win);
> > > tp->rcv_ssthresh = min(tp->window_clamp, 2U*tp->advmss);
> > > }
> > >}
> >
> > I'm very happy to report that the above patch, applied to 2.6.12.6, seems
> > to have cured the TCP window problem we were experiencing.
>
> Good. I think the patch is to be applied to all mainstream kernels.

Has anyone looked at the patch I sent out on Sept 9? It goes a few steps
further, addressing some additional problems. Original message below.

Thanks,
-John

-----

This is a patch for discussion addressing some receive buffer growing issues.  
This is partially related to the thread "Possible BUG in IPv4 TCP window
handling..." last week.

Specifically it addresses the problem of an interaction between rcvbuf
moderation (receiver autotuning) and rcv_ssthresh.  The problem occurs when
sending small packets to a receiver with a larger MTU.  (A very common case I
have is a host with a 1500 byte MTU sending to a host with a 9k MTU.)  In
such a case, the rcv_ssthresh code is targeting a window size corresponding
to filling up the current rcvbuf, not taking into account that the new rcvbuf
moderation may increase the rcvbuf size.

One hunk makes rcv_ssthresh use tcp_rmem[2] as the size target rather than
rcvbuf.  The other changes the behavior when it overflows its memory bounds
with in-order data so that it tries to grow rcvbuf (the same as with
out-of-order data).

These changes should help my problem of mixed MTUs, and should also help the
case from last week's thread I think.  (In both cases though you still need
tcp_rmem[2] to be set much larger than the TCP window.)  One question is if
this is too aggressive at trying to increase rcvbuf if it's under memory
stress.

  -John


Signed-off-by: John Heffner <jheffner@xxxxxxx>
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -233,7 +233,7 @@ static int __tcp_grow_window(const struc
{
/* Optimize this! */
int truesize = tcp_win_from_space(skb->truesize)/2;
- int window = tcp_full_space(sk)/2;
+ int window = tcp_win_from_space(sysctl_tcp_rmem[2])/2;

while (tp->rcv_ssthresh <= window) {
if (truesize <= skb->len)
@@ -326,39 +326,18 @@ static void tcp_init_buffer_space(struct
static void tcp_clamp_window(struct sock *sk, struct tcp_sock *tp)
{
struct inet_connection_sock *icsk = inet_csk(sk);
- struct sk_buff *skb;
- unsigned int app_win = tp->rcv_nxt - tp->copied_seq;
- int ofo_win = 0;

icsk->icsk_ack.quick = 0;

- skb_queue_walk(&tp->out_of_order_queue, skb) {
- ofo_win += skb->len;
+ if (sk->sk_rcvbuf < sysctl_tcp_rmem[2] &&
+ !(sk->sk_userlocks & SOCK_RCVBUF_LOCK) &&
+ !tcp_memory_pressure &&
+ atomic_read(&tcp_memory_allocated) < sysctl_tcp_mem[0]) {
+ sk->sk_rcvbuf = min(atomic_read(&sk->sk_rmem_alloc),
+ sysctl_tcp_rmem[2]);
}
-
- /* If overcommit is due to out of order segments,
- * do not clamp window. Try to expand rcvbuf instead.
- */
- if (ofo_win) {
- if (sk->sk_rcvbuf < sysctl_tcp_rmem[2] &&
- !(sk->sk_userlocks & SOCK_RCVBUF_LOCK) &&
- !tcp_memory_pressure &&
- atomic_read(&tcp_memory_allocated) < sysctl_tcp_mem[0])
- sk->sk_rcvbuf = min(atomic_read(&sk->sk_rmem_alloc),
- sysctl_tcp_rmem[2]);
- }
- if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) {
- app_win += ofo_win;
- if (atomic_read(&sk->sk_rmem_alloc) >= 2 * sk->sk_rcvbuf)
- app_win >>= 1;
- if (app_win > icsk->icsk_ack.rcv_mss)
- app_win -= icsk->icsk_ack.rcv_mss;
- app_win = max(app_win, 2U*tp->advmss);
-
- if (!ofo_win)
- tp->window_clamp = min(tp->window_clamp, app_win);
+ if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
tp->rcv_ssthresh = min(tp->window_clamp, 2U*tp->advmss);
- }
}

/* Receiver "autotuning" code.