Re: [PATCH] tcp: check socket state before calling WARN_ON

From: Neal Cardwell
Date: Wed Dec 04 2024 - 09:22:18 EST


On Wed, Dec 4, 2024 at 2:48 AM Dujeong.lee <dujeong.lee@xxxxxxxxxxx> wrote:
>
> On Wed, Dec 4, 2024 at 4:14 PM Eric Dumazet wrote:
> > To: Youngmin Nam <youngmin.nam@xxxxxxxxxxx>
> > Cc: Jakub Kicinski <kuba@xxxxxxxxxx>; Neal Cardwell <ncardwell@xxxxxxxxxx>;
> > davem@xxxxxxxxxxxxx; dsahern@xxxxxxxxxx; pabeni@xxxxxxxxxx;
> > horms@xxxxxxxxxx; dujeong.lee@xxxxxxxxxxx; guo88.liu@xxxxxxxxxxx;
> > yiwang.cai@xxxxxxxxxxx; netdev@xxxxxxxxxxxxxxx; linux-
> > kernel@xxxxxxxxxxxxxxx; joonki.min@xxxxxxxxxxx; hajun.sung@xxxxxxxxxxx;
> > d7271.choe@xxxxxxxxxxx; sw.ju@xxxxxxxxxxx
> > Subject: Re: [PATCH] tcp: check socket state before calling WARN_ON
> >
> > On Wed, Dec 4, 2024 at 4:35 AM Youngmin Nam <youngmin.nam@xxxxxxxxxxx>
> > wrote:
> > >
> > > On Tue, Dec 03, 2024 at 06:18:39PM -0800, Jakub Kicinski wrote:
> > > > On Tue, 3 Dec 2024 10:34:46 -0500 Neal Cardwell wrote:
> > > > > > I have not seen these warnings firing. Neal, have you seen this in
> > the past ?
> > > > >
> > > > > I can't recall seeing these warnings over the past 5 years or so,
> > > > > and (from checking our monitoring) they don't seem to be firing in
> > > > > our fleet recently.
> > > >
> > > > FWIW I see this at Meta on 5.12 kernels, but nothing since.
> > > > Could be that one of our workloads is pinned to 5.12.
> > > > Youngmin, what's the newest kernel you can repro this on?
> > > >
> > > Hi Jakub.
> > > Thank you for taking an interest in this issue.
> > >
> > > We've seen this issue since 5.15 kernel.
> > > Now, we can see this on 6.6 kernel which is the newest kernel we are
> > running.
> >
> > The fact that we are processing ACK packets after the write queue has been
> > purged would be a serious bug.
> >
> > Thus the WARN() makes sense to us.
> >
> > It would be easy to build a packetdrill test. Please do so, then we can
> > fix the root cause.
> >
> > Thank you !
>
>
> Please let me share some more details and clarifications on the issue from ramdump snapshot locally secured.
>
> 1) This issue has been reported from Android-T linux kernel when we enabled panic_on_warn for the first time.
> Reproduction rate is not high and can be seen in any test cases with public internet connection.
>
> 2) Analysis from ramdump (which is not available at the moment).
> 2-A) From ramdump, I was able to find below values.
> tp->packets_out = 0
> tp->retrans_out = 1
> tp->max_packets_out = 1
> tp->max_packets_Seq = 1575830358
> tp->snd_ssthresh = 5
> tp->snd_cwnd = 1
> tp->prior_cwnd = 10
> tp->wite_seq = 1575830359
> tp->pushed_seq = 1575830358
> tp->lost_out = 1
> tp->sacked_out = 0

Thanks for all the details! If the ramdump becomes available again at
some point, would it be possible to pull out the following values as
well:

tp->mss_cache
inet_csk(sk)->icsk_pmtu_cookie
inet_csk(sk)->icsk_ca_state

Thanks,
neal