Re: [PATCH net-next] tcp: Improve setsockopt() TCP_USER_TIMEOUT accuracy

From: Jonathan Maxwell
Date: Mon Jul 02 2018 - 23:16:30 EST


On Tue, Jul 3, 2018 at 1:00 PM, Neal Cardwell <ncardwell@xxxxxxxxxx> wrote:
> On Mon, Jul 2, 2018 at 9:18 PM Jon Maxwell <jmaxwell37@xxxxxxxxx> wrote:
>>
>> Every time the TCP retransmission timer fires. It checks to see if there is a
>> timeout before scheduling the next retransmit timer. The retransmit interval
>> between each retransmission increases exponentially. The issue is that in order
>> for the timeout to occur the retransmit timer needs to fire again. If the user
>> timeout check happens after the 9th retransmit for example. It needs to wait for
>> the 10th retransmit timer to fire in order to evaluate whether a timeout has
>> occurred or not. If the interval is large enough then the timeout will be
>> inaccurate.
>>
>> For example with a TCP_USER_TIMEOUT of 10 seconds without patch:
>>
>> 1st retransmit:
>>
>> 22:25:18.973488 IP host1.49310 > host2.search-agent: Flags [.]
>>
>> Last retransmit:
>>
>> 22:25:26.205499 IP host1.49310 > host2.search-agent: Flags [.]
>>
>> Timeout:
>>
>> send: Connection timed out
>> Sun Jul 1 22:25:34 EDT 2018
>>
>> We can see that last retransmit took ~7 seconds. Which pushed the total
>> timeout to ~15 seconds instead of the expected 10 seconds. This gets more
>> inaccurate the larger the TCP_USER_TIMEOUT value. As the interval increases.
>>
>> Fix this by recalculating the last retransmit interval so that it fires when
>> the timeout should occur. Only implement when icsk->icsk_user_timeout is set.
>>
>> Test results with the patch is the expected 10 second timeout:
>>
>> 1st retransmit:
>>
>> 01:37:59.022555 IP host1.49310 > host2.search-agent: Flags [.]
>>
>> Last retransmit:
>>
>> 01:38:06.486558 IP host1.49310 > host2.search-agent: Flags [.]
>>
>> Timeout:
>>
>> send: Connection timed out
>> Mon Jul 2 01:38:09 EDT 2018
>>
>> Signed-off-by: Jon Maxwell <jmaxwell37@xxxxxxxxx>
>> ---
>> net/ipv4/tcp_timer.c | 7 +++++++
>> 1 file changed, 7 insertions(+)
>>
>> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>> index 3b3611729928..94491a481722 100644
>> --- a/net/ipv4/tcp_timer.c
>> +++ b/net/ipv4/tcp_timer.c
>> @@ -407,6 +407,7 @@ void tcp_retransmit_timer(struct sock *sk)
>> struct tcp_sock *tp = tcp_sk(sk);
>> struct net *net = sock_net(sk);
>> struct inet_connection_sock *icsk = inet_csk(sk);
>> + __u32 time_remaining = 0;
>>
>> if (tp->fastopen_rsk) {
>> WARN_ON_ONCE(sk->sk_state != TCP_SYN_RECV &&
>> @@ -535,6 +536,12 @@ void tcp_retransmit_timer(struct sock *sk)
>> /* Use normal (exponential) backoff */
>> icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
>> }
>> + if (icsk->icsk_user_timeout) {
>> + time_remaining = jiffies_to_msecs(icsk->icsk_user_timeout) -
>> + (tcp_time_stamp(tcp_sk(sk)) - tcp_sk(sk)->retrans_stamp);
>> + if (time_remaining < icsk->icsk_rto)
>> + icsk->icsk_rto = time_remaining;
>> + }
>
> Thanks, a more precise user timeout sounds nice. A couple thoughts:
>
> (a) The icsk->icsk_rto is in jiffies, and the time_remaining is in
> msecs, so it looks like there is a units mismatch here in the
> comparisons and assignment.
>
> (b) It also seems like the time_remaining could be negative, because
> (a) the icsk_user_timeout is not involved in the baseline RTO
> calculation, so that perhaps the first RTO to fire might be beyond the
> icsk_user_timeout AFAIK, and (b) if the machine is very busy then the
> timer handler can be delayed beyond the targeted icsk_user_timeout.
> But time_remaining is a __u32, and icsk->icsk_rto is also a __u32, so
> it seems like a negative number in time_remaining would usually be
> treated as a very large unsigned positive number in this comparison:
>
> + if (time_remaining < icsk->icsk_rto)
>
> (c) If the user timeout is changed between RTO expirations to push the
> user timeout further in the future, then it seems like this commit
> will have side effects that left the icsk->icsk_rto in a weird state
> that does not do the expected exponential backoff correctly.
>
> (d) There are also wrapping issues to watch out for, since the
> tcp_time_stamp(tcp_sk(sk)) and tcp_sk(sk)->retrans_stamp are
> milliseconds, which will wrap every 49 days or so. Seems like the code
> is OK in that respect.
>
> (e) It also might be nice to put this logic in a helper, rather than
> growing the body of tcp_retransmit_timer().
>
> What about something like (pseudocode):
>
> --
>
> static __u32 tcp_clamp_rto_to_user_timeout(sk):
> rto = icsk->icsk_rto;
> if (!icsk->icsk_user_timeout)
> return rto;
> elapsed = tcp_time_stamp(tcp_sk(sk)) - tcp_sk(sk)->retrans_stamp;
> user_timeout = jiffies_to_msecs(icsk->icsk_user_timeout);
> if (elapsed >= user_timeout)
> rto = 1; /* user timeout has passed; fire ASAP */
> else
> rto = min(rto, msecs_to_jiffies(user_timeout - elapsed));
> return rto;
>
> tcp_retransmit_timer():
> ...
> rto = tcp_clamp_rto_to_user_timeout(sk);
> inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto, TCP_RTO_MAX);
>

Thanks Neal, that looks like a good idea. Let me test that out in my reproducer.

Regards

Jon

> --
>
> neal