Re: [net-next PATCH] tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit
From: Alexander Duyck
Date: Sat Dec 12 2020 - 15:04:21 EST
On Sat, Dec 12, 2020 at 11:07 AM Yuchung Cheng <ycheng@xxxxxxxxxx> wrote:
>
> On Sat, Dec 12, 2020 at 11:01 AM Alexander Duyck
> <alexander.duyck@xxxxxxxxx> wrote:
> >
> > On Sat, Dec 12, 2020 at 10:34 AM Yuchung Cheng <ycheng@xxxxxxxxxx> wrote:
> > >
> > > On Fri, Dec 11, 2020 at 5:28 PM Alexander Duyck
> > > <alexander.duyck@xxxxxxxxx> wrote:
> > > >
> > > > From: Alexander Duyck <alexanderduyck@xxxxxx>
> > > >
> > > > There are cases where a fastopen SYN may trigger either a ICMP_TOOBIG
> > > > message in the case of IPv6 or a fragmentation request in the case of
> > > > IPv4. This results in the socket stalling for a second or more as it does
> > > > not respond to the message by retransmitting the SYN frame.
> > > >
> > > > Normally a SYN frame should not be able to trigger a ICMP_TOOBIG or
> > > > ICMP_FRAG_NEEDED however in the case of fastopen we can have a frame that
> > > > makes use of the entire MSS. In the case of fastopen it does, and an
> > > > additional complication is that the retransmit queue doesn't contain the
> > > > original frames. As a result when tcp_simple_retransmit is called and
> > > > walks the list of frames in the queue it may not mark the frames as lost
> > > > because both the SYN and the data packet each individually are smaller than
> > > > the MSS size after the adjustment. This results in the socket being stalled
> > > > until the retransmit timer kicks in and forces the SYN frame out again
> > > > without the data attached.
> > > >
> > > > In order to resolve this we can generate our best estimate for the original
> > > > packet size by detecting the fastopen SYN frame and then adding the
> > > > overhead for MAX_TCP_OPTION_SPACE and verifying if the SYN w/ data would
> > > > have exceeded the MSS. If so we can mark the frame as lost and retransmit
> > > > it.
> > > >
> > > > Signed-off-by: Alexander Duyck <alexanderduyck@xxxxxx>
> > > > ---
> > > > net/ipv4/tcp_input.c | 30 +++++++++++++++++++++++++++---
> > > > 1 file changed, 27 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > > > index 9e8a6c1aa019..79375b58de84 100644
> > > > --- a/net/ipv4/tcp_input.c
> > > > +++ b/net/ipv4/tcp_input.c
> > > > @@ -2686,11 +2686,35 @@ static void tcp_mtup_probe_success(struct sock *sk)
> > > > void tcp_simple_retransmit(struct sock *sk)
> > > > {
> > > > const struct inet_connection_sock *icsk = inet_csk(sk);
> > > > + struct sk_buff *skb = tcp_rtx_queue_head(sk);
> > > > struct tcp_sock *tp = tcp_sk(sk);
> > > > - struct sk_buff *skb;
> > > > - unsigned int mss = tcp_current_mss(sk);
> > > > + unsigned int mss;
> > > > +
> > > > + /* A fastopen SYN request is stored as two separate packets within
> > > > + * the retransmit queue, this is done by tcp_send_syn_data().
> > > > + * As a result simply checking the MSS of the frames in the queue
> > > > + * will not work for the SYN packet. So instead we must make a best
> > > > + * effort attempt by validating the data frame with the mss size
> > > > + * that would be computed now by tcp_send_syn_data and comparing
> > > > + * that against the data frame that would have been included with
> > > > + * the SYN.
> > > > + */
> > > > + if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN && tp->syn_data) {
> > > > + struct sk_buff *syn_data = skb_rb_next(skb);
> > > > +
> > > > + mss = tcp_mtu_to_mss(sk, icsk->icsk_pmtu_cookie) +
> > > > + tp->tcp_header_len - sizeof(struct tcphdr) -
> > > > + MAX_TCP_OPTION_SPACE;
> > > nice comment! The original syn_data mss needs to be inferred which is
> > > a hassle to get right. my sense is path-mtu issue is enough to warrant
> > > they are lost.
> > > I suggest simply mark syn & its data lost if tcp_simple_retransmit is
> > > called during TFO handshake, i.e.
> > >
> > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > > index 62f7aabc7920..7f0c4f2947eb 100644
> > > --- a/net/ipv4/tcp_input.c
> > > +++ b/net/ipv4/tcp_input.c
> > > @@ -2864,7 +2864,8 @@ void tcp_simple_retransmit(struct sock *sk)
> > > unsigned int mss = tcp_current_mss(sk);
> > >
> > > skb_rbtree_walk(skb, &sk->tcp_rtx_queue) {
> > > - if (tcp_skb_seglen(skb) > mss)
> > > + if (tcp_skb_seglen(skb) > mss ||
> > > + (tp->syn_data && sk->sk_state == TCP_SYN_SENT))
> > > tcp_mark_skb_lost(sk, skb);
> > > }
> > >
> > > We have a TFO packetdrill test that verifies my suggested fix should
> > > trigger an immediate retransmit vs 1s wait.
> >
> > Okay, I will go that route, although I will still probably make one
> > minor cleanup. Instead of testing for syn_data and state per packet I
> > will probably keep the bit where I overwrite mss since it is only used
> > in the loop. What I can do is switch it from unsigned int to int since
> > technically tcp_current_mss and tcp_skb_seglen are both a signed int
> > anyway. Then I can just set mss to -1 in the syn_data && TCP_SYN_SENT
> > case. That way all of the frames in the ring should fail the check
> > while only having to add one initial check outside the loop.
> sounds good. I thought about that too but decided it's normally just
> two skbs and not worth the hassle.
>
> btw I thought about negatively caching this event to disable TFO
> completely afterward, but I assume it's not needed as the ICMP error
> should correct the route mtu hence future TFO should be fine. If
> that's not the case we might want to consider disable TFO to avoid
> this future "TFO->ICMP->SYN-retry" repetition.
At least in my testing there are no further occurrences after the
first. Basically the first one hits and then all the follow-on
requests use the correct MSS based on the cached path MTU. As such
there isn't any penalty to leaving the feature enabled as over the
course of several transactions it still provides more benefit than
harm.