Re: [PATCH net 1/2] tcp: Don't accept data when socket is in repair mode

From: Kuniyuki Iwashima

Date: Sun May 17 2026 - 15:06:11 EST

On Sun, May 17, 2026 at 11:41 AM Stefano Brivio <sbrivio@xxxxxxxxxx> wrote:
>
> Once a socket enters repair mode (TCP_REPAIR socket option with
> TCP_REPAIR_ON value), it's possible to dump the receive sequence
> number (TCP_QUEUE_SEQ) and the contents of the receive queue itself
> (using TCP_REPAIR_QUEUE to select it).
>
> If we receive data after the application fetched the sequence number
> or saved the contents of the queue, though, the application will now
> have outdated information, which defeats the whole functionality,
> because this leads to gaps in sequence and data once they're restored
> by the target instance of the application, resulting in a hanging or
> otherwise non-functional TCP connection.
>
> This type of race condition was discovered in the KubeVirt integration
> of passt(1), using a remote iperf3 client connected to an iperf3
> server running in the guest which is being migrated. The setup allows
> traffic to reach the origin node hosting the guest during the
> migration.
>
> If passt dumps sequence number and contents of the queue *before*
> further data is received and acknowledged to the peer by the kernel,
> once the TCP data connection is migrated to the target node, the
> remote client becomes unable to continue sending, because a portion
> of the data it sent *and received an acknowledgement for* is now lost.
>
> Schematically:
>
> 1. client --seq 1:100--> origin host --> passt --> guest --> server
>
> 2. client <--ACK: 100-- origin host
>
> 3. migration starts,

Here, a netfilter rule or bpf prog must be installed to
drop packets temporarily until migration completes.

We do not want unlikely tests in the fast path.

You can find a similar issue:
https://lore.kernel.org/netdev/20260130145122.368748-1-me@linux.beauty/

> passt enables repair mode, dumps the sequence
> number (101) and sends it to the target node of the guest migration
>
> 4. client --seq 101:201--> origin host (passt not receiving anymore)
>
> 5. client <--ACK: 201-- origin host
>
> 6. migration completes, and passt restores sequence number 101 on the
> migrated socket
>
> 7. client --seq 201:301--> target host (now seeing a sequence jump)
>
> 8. client <--ACK: 100-- target host
>
> ...and the connection can't recover anymore, because the client can't
> resend data that was already (erroneously) acknowledged. We need to
> avoid step 5. above.
>
> This would equally affect CRIU (the other known user of TCP_REPAIR),
> should data be received while the original container is frozen: the
> sequence dumped and the contents of the saved incoming queue would
> then depend on the timing.
>
> The race condition is also illustrated in the kselftests introduced
> by the next patch.
>
> To prevent this issue, discard data received for a socket in repair
> mode, with a new reason, SKB_DROP_REASON_SOCKET_REPAIR.
>
> Fixes: ee9952831cfd ("tcp: Initial repair mode")
> Tested-by: Laurent Vivier <lvivier@xxxxxxxxxx>
> Signed-off-by: Stefano Brivio <sbrivio@xxxxxxxxxx>
> ---
> include/net/dropreason-core.h | 3 +++
> net/ipv4/tcp_input.c | 14 +++++++++++++-
> 2 files changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/include/net/dropreason-core.h b/include/net/dropreason-core.h
> index 2f312d1f67d6..19ab9e6ffc33 100644
> --- a/include/net/dropreason-core.h
> +++ b/include/net/dropreason-core.h
> @@ -9,6 +9,7 @@
> FN(SOCKET_CLOSE) \
> FN(SOCKET_FILTER) \
> FN(SOCKET_RCVBUFF) \
> + FN(SOCKET_REPAIR) \
> FN(UNIX_DISCONNECT) \
> FN(UNIX_SKIP_OOB) \
> FN(PKT_TOO_SMALL) \
> @@ -158,6 +159,8 @@ enum skb_drop_reason {
> SKB_DROP_REASON_SOCKET_FILTER,
> /** @SKB_DROP_REASON_SOCKET_RCVBUFF: socket receive buff is full */
> SKB_DROP_REASON_SOCKET_RCVBUFF,
> + /** @SKB_DROP_REASON_SOCKET_REPAIR: socket is in repair mode */
> + SKB_DROP_REASON_SOCKET_REPAIR,
> /**
> * @SKB_DROP_REASON_UNIX_DISCONNECT: recv queue is purged when SOCK_DGRAM
> * or SOCK_SEQPACKET socket re-connect()s to another socket or notices
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index d5c9e65d9760..6eca34274f97 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -6457,6 +6457,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
> * or pure receivers (this means either the sequence number or the ack
> * value must stay constant)
> * - Unexpected TCP option.
> + * - Socket is in repair mode.
> *
> * When these conditions are not satisfied it drops into a standard
> * receive procedure patterned after RFC793 to handle all cases.
> @@ -6506,7 +6507,8 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
>
> if ((tcp_flag_word(th) & TCP_HP_BITS) == tp->pred_flags &&
> TCP_SKB_CB(skb)->seq == tp->rcv_nxt &&
> - !after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) {
> + !after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt) &&
> + !tp->repair) {
> int tcp_header_len = tp->tcp_header_len;
> s32 delta = 0;
> int flag = 0;
> @@ -6632,6 +6634,11 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
> goto discard;
> }
>
> + if (tp->repair) {
> + reason = SKB_DROP_REASON_SOCKET_REPAIR;
> + goto discard;
> + }
> +
> /*
> * Standard slow path.
> */
> @@ -7125,6 +7132,11 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
> int queued = 0;
> SKB_DR(reason);
>
> + if (tp->repair) {
> + SKB_DR_SET(reason, SOCKET_REPAIR);
> + goto discard;
> + }
> +
> switch (sk->sk_state) {
> case TCP_CLOSE:
> SKB_DR_SET(reason, TCP_CLOSE);
> --
> 2.43.0
>