RE: [PATCH] tcp: check socket state before calling WARN_ON
From: Dujeong.lee
Date: Wed Dec 04 2024 - 03:01:35 EST
On Wed, Dec 4, 2024 at 4:14 PM Eric Dumazet wrote:
> To: Youngmin Nam <youngmin.nam@xxxxxxxxxxx>
> Cc: Jakub Kicinski <kuba@xxxxxxxxxx>; Neal Cardwell <ncardwell@xxxxxxxxxx>;
> davem@xxxxxxxxxxxxx; dsahern@xxxxxxxxxx; pabeni@xxxxxxxxxx;
> horms@xxxxxxxxxx; dujeong.lee@xxxxxxxxxxx; guo88.liu@xxxxxxxxxxx;
> yiwang.cai@xxxxxxxxxxx; netdev@xxxxxxxxxxxxxxx; linux-
> kernel@xxxxxxxxxxxxxxx; joonki.min@xxxxxxxxxxx; hajun.sung@xxxxxxxxxxx;
> d7271.choe@xxxxxxxxxxx; sw.ju@xxxxxxxxxxx
> Subject: Re: [PATCH] tcp: check socket state before calling WARN_ON
>
> On Wed, Dec 4, 2024 at 4:35 AM Youngmin Nam <youngmin.nam@xxxxxxxxxxx>
> wrote:
> >
> > On Tue, Dec 03, 2024 at 06:18:39PM -0800, Jakub Kicinski wrote:
> > > On Tue, 3 Dec 2024 10:34:46 -0500 Neal Cardwell wrote:
> > > > > I have not seen these warnings firing. Neal, have you seen this in
> the past ?
> > > >
> > > > I can't recall seeing these warnings over the past 5 years or so,
> > > > and (from checking our monitoring) they don't seem to be firing in
> > > > our fleet recently.
> > >
> > > FWIW I see this at Meta on 5.12 kernels, but nothing since.
> > > Could be that one of our workloads is pinned to 5.12.
> > > Youngmin, what's the newest kernel you can repro this on?
> > >
> > Hi Jakub.
> > Thank you for taking an interest in this issue.
> >
> > We've seen this issue since 5.15 kernel.
> > Now, we can see this on 6.6 kernel which is the newest kernel we are
> running.
>
> The fact that we are processing ACK packets after the write queue has been
> purged would be a serious bug.
>
> Thus the WARN() makes sense to us.
>
> It would be easy to build a packetdrill test. Please do so, then we can
> fix the root cause.
>
> Thank you !
Please let me share some more details and clarifications on the issue from ramdump snapshot locally secured.
1) This issue has been reported from Android-T linux kernel when we enabled panic_on_warn for the first time.
Reproduction rate is not high and can be seen in any test cases with public internet connection.
2) Analysis from ramdump (which is not available at the moment).
2-A) From ramdump, I was able to find below values.
tp->packets_out = 0
tp->retrans_out = 1
tp->max_packets_out = 1
tp->max_packets_Seq = 1575830358
tp->snd_ssthresh = 5
tp->snd_cwnd = 1
tp->prior_cwnd = 10
tp->wite_seq = 1575830359
tp->pushed_seq = 1575830358
tp->lost_out = 1
tp->sacked_out = 0
2-B) The last Tx packet from the device is (Time: 17371.562934)
Hex:
4500005b95a3400040063e34c0a848188efacf0a888a01bb5ded432f5ad8ab29801800495b5800000101080a3a52197fef299d901703030022f3589123b0523bdd07be137a98ca9b5d3475332d4382c7b420571e6d437a07ba7787
Internet Protocol Version 4
0100 .... = Version: 4
.... 0101 = Header Length: 20 bytes (5)
Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
Total Length: 91
Identification: 0x95a3 (38307)
010. .... = Flags: 0x2, Don't fragment
...0 0000 0000 0000 = Fragment Offset: 0
Time to Live: 64
Protocol: TCP (6)
Header Checksum: 0x3e34
Header checksum status: Unverified
Source Address: 192.168.72.24
Destination Address: 142.250.207.10
Stream index: 0
Transmission Control Protocol
Source Port: 34954
Destination Port: 443
Stream index: 0
Conversation completeness: Incomplete (0)
TCP Segment Len: 39
Sequence Number: 0x5ded432f
Sequence Number (raw): 1575830319
Next Sequence Number: 40
Acknowledgment Number: 0x5ad8ab29
Acknowledgment number (raw): 1524149033
1000 .... = Header Length: 32 bytes (8)
Flags: 0x018 (PSH, ACK)
Window: 73
Calculated window size: 73
Window size scaling factor: -1 (unknown)
Checksum: 0x5b58
Checksum Status: Unverified
Urgent Pointer: 0
Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps
Timestamps
SEQ/ACK analysis
TCP payload (39 bytes)
Transport Layer Security
TLSv1.2 Record Layer: Application Data Protocol: Hypertext Transfer Protocol
2-C) When warn hit, DUT was processing (Time: 17399.502603, 28 seconds later since last Tx)
Hex:
456000405FA20000720681F08EFACF0AC0A8481801BB888A5AD8AB295DED4356B010010D93D800000101080AEF299EF43A52089F0101050A5DED432F5DED4356
Internet Protocol Version 4
0100 .... = Version: 4
.... 0101 = Header Length: 20 bytes (5)
Differentiated Services Field: 0x60 (DSCP: CS3, ECN: Not-ECT)
Total Length: 64
Identification: 0x5fa2 (24482)
000. .... = Flags: 0x0
...0 0000 0000 0000 = Fragment Offset: 0
Time to Live: 114
Protocol: TCP (6)
Header Checksum: 0x81f0
Header checksum status: Unverified
Source Address: 142.250.207.10
Destination Address: 192.168.72.24
Stream index: 0
Transmission Control Protocol
Source Port: 443
Destination Port: 34954
Stream index: 0
Conversation completeness: Incomplete (0)
TCP Segment Len: 0
Sequence Number: 0x5ad8ab29
Sequence Number (raw): 1524149033
Next Sequence Number: 1
Acknowledgment Number: 0x5ded4356
Acknowledgment number (raw): 1575830358
1011 .... = Header Length: 44 bytes (11)
Flags: 0x010 (ACK)
Window: 269
Calculated window size: 269
Window size scaling factor: -1 (unknown)
Checksum: 0x93d8
Checksum Status: Unverified
Urgent Pointer: 0
Options: (24 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps, No-Operation (NOP), No-Operation (NOP), SACK
Timestamps
2-D) The DUT received ack after 28 seconds from Access Point.
3)Clarification on "tcp_write_queue_purge" claim
This is just my conjecture based on ramdump snapshot and it is not shown in calltrace.
Based on tcp status in snapshot I thought tcp_write_queue_purge was called and packets_out was cleared.
4) In our kernel "/proc/sys/net/ipv4/tcp_mtu_probing" is set to 0.