RE: [PATCH] tcp: check socket state before calling WARN_ON

From: Dujeong.lee
Date: Thu Dec 05 2024 - 07:31:25 EST


On Wed, Dec 4, 2024 at 11:22 PM Neal Cardwell <ncardwell@xxxxxxxxxx> wrote:
> On Wed, Dec 4, 2024 at 2:48 AM Dujeong.lee <dujeong.lee@xxxxxxxxxxx> wrote:
> > On Wed, Dec 4, 2024 at 4:14 PM Eric Dumazet wrote:
> > > To: Youngmin Nam <youngmin.nam@xxxxxxxxxxx>
> > > Cc: Jakub Kicinski <kuba@xxxxxxxxxx>; Neal Cardwell
> > > <ncardwell@xxxxxxxxxx>; davem@xxxxxxxxxxxxx; dsahern@xxxxxxxxxx;
> > > pabeni@xxxxxxxxxx; horms@xxxxxxxxxx; dujeong.lee@xxxxxxxxxxx;
> > > guo88.liu@xxxxxxxxxxx; yiwang.cai@xxxxxxxxxxx;
> > > netdev@xxxxxxxxxxxxxxx; linux- kernel@xxxxxxxxxxxxxxx;
> > > joonki.min@xxxxxxxxxxx; hajun.sung@xxxxxxxxxxx;
> > > d7271.choe@xxxxxxxxxxx; sw.ju@xxxxxxxxxxx
> > > Subject: Re: [PATCH] tcp: check socket state before calling WARN_ON
> > >
> > > On Wed, Dec 4, 2024 at 4:35 AM Youngmin Nam
> > > <youngmin.nam@xxxxxxxxxxx>
> > > wrote:
> > > >
> > > > On Tue, Dec 03, 2024 at 06:18:39PM -0800, Jakub Kicinski wrote:
> > > > > On Tue, 3 Dec 2024 10:34:46 -0500 Neal Cardwell wrote:
> > > > > > > I have not seen these warnings firing. Neal, have you seen
> > > > > > > this in
> > > the past ?
> > > > > >
> > > > > > I can't recall seeing these warnings over the past 5 years or
> > > > > > so, and (from checking our monitoring) they don't seem to be
> > > > > > firing in our fleet recently.
> > > > >
> > > > > FWIW I see this at Meta on 5.12 kernels, but nothing since.
> > > > > Could be that one of our workloads is pinned to 5.12.
> > > > > Youngmin, what's the newest kernel you can repro this on?
> > > > >
> > > > Hi Jakub.
> > > > Thank you for taking an interest in this issue.
> > > >
> > > > We've seen this issue since 5.15 kernel.
> > > > Now, we can see this on 6.6 kernel which is the newest kernel we
> > > > are
> > > running.
> > >
> > > The fact that we are processing ACK packets after the write queue
> > > has been purged would be a serious bug.
> > >
> > > Thus the WARN() makes sense to us.
> > >
> > > It would be easy to build a packetdrill test. Please do so, then we
> > > can fix the root cause.
> > >
> > > Thank you !
> >
> >
> > Please let me share some more details and clarifications on the issue
> from ramdump snapshot locally secured.
> >
> > 1) This issue has been reported from Android-T linux kernel when we
> enabled panic_on_warn for the first time.
> > Reproduction rate is not high and can be seen in any test cases with
> public internet connection.
> >
> > 2) Analysis from ramdump (which is not available at the moment).
> > 2-A) From ramdump, I was able to find below values.
> > tp->packets_out = 0
> > tp->retrans_out = 1
> > tp->max_packets_out = 1
> > tp->max_packets_Seq = 1575830358
> > tp->snd_ssthresh = 5
> > tp->snd_cwnd = 1
> > tp->prior_cwnd = 10
> > tp->wite_seq = 1575830359
> > tp->pushed_seq = 1575830358
> > tp->lost_out = 1
> > tp->sacked_out = 0
>
> Thanks for all the details! If the ramdump becomes available again at some
> point, would it be possible to pull out the following values as
> well:
>
> tp->mss_cache
> inet_csk(sk)->icsk_pmtu_cookie
> inet_csk(sk)->icsk_ca_state
>
> Thanks,
> neal

Okay I will check the below values once ramdump is secured.
- tp->mss_cache
- inet_csk(sk)->icsk_pmtu_cookie
- inet_csk(sk)->icsk_ca_state

Now we are running test with the latest kernel.

Thanks
Dujeong.