Re: [PATCH] tcp: check socket state before calling WARN_ON

From: Neal Cardwell
Date: Fri Dec 06 2024 - 10:38:35 EST


On Fri, Dec 6, 2024 at 4:08 AM Eric Dumazet <edumazet@xxxxxxxxxx> wrote:
>
> On Fri, Dec 6, 2024 at 9:58 AM Youngmin Nam <youngmin.nam@xxxxxxxxxxx> wrote:
> >
> > On Fri, Dec 06, 2024 at 09:35:32AM +0100, Eric Dumazet wrote:
> > > On Fri, Dec 6, 2024 at 6:50 AM Youngmin Nam <youngmin.nam@xxxxxxxxxxx> wrote:
> > > >
> > > > On Wed, Dec 04, 2024 at 08:13:33AM +0100, Eric Dumazet wrote:
> > > > > On Wed, Dec 4, 2024 at 4:35 AM Youngmin Nam <youngmin.nam@xxxxxxxxxxx> wrote:
> > > > > >
> > > > > > On Tue, Dec 03, 2024 at 06:18:39PM -0800, Jakub Kicinski wrote:
> > > > > > > On Tue, 3 Dec 2024 10:34:46 -0500 Neal Cardwell wrote:
> > > > > > > > > I have not seen these warnings firing. Neal, have you seen this in the past ?
> > > > > > > >
> > > > > > > > I can't recall seeing these warnings over the past 5 years or so, and
> > > > > > > > (from checking our monitoring) they don't seem to be firing in our
> > > > > > > > fleet recently.
> > > > > > >
> > > > > > > FWIW I see this at Meta on 5.12 kernels, but nothing since.
> > > > > > > Could be that one of our workloads is pinned to 5.12.
> > > > > > > Youngmin, what's the newest kernel you can repro this on?
> > > > > > >
> > > > > > Hi Jakub.
> > > > > > Thank you for taking an interest in this issue.
> > > > > >
> > > > > > We've seen this issue since 5.15 kernel.
> > > > > > Now, we can see this on 6.6 kernel which is the newest kernel we are running.
> > > > >
> > > > > The fact that we are processing ACK packets after the write queue has
> > > > > been purged would be a serious bug.
> > > > >
> > > > > Thus the WARN() makes sense to us.
> > > > >
> > > > > It would be easy to build a packetdrill test. Please do so, then we
> > > > > can fix the root cause.
> > > > >
> > > > > Thank you !
> > > > >
> > > >
> > > > Hi Eric.
> > > >
> > > > Unfortunately, we are not familiar with the Packetdrill test.
> > > > Refering to the official website on Github, I tried to install it on my device.
> > > >
> > > > Here is what I did on my local machine.
> > > >
> > > > $ mkdir packetdrill
> > > > $ cd packetdrill
> > > > $ git clone https://protect2.fireeye.com/v1/url?k=746d28f3-15e63dd6-746ca3bc-74fe485cbff6-e405b48a4881ecfc&q=1&e=ca164227-d8ec-4d3c-bd27-af2d38964105&u=https%3A%2F%2Fgithub.com%2Fgoogle%2Fpacketdrill.git .
> > > > $ cd gtests/net/packetdrill/
> > > > $./configure
> > > > $ make CC=/home/youngmin/Downloads/arm-gnu-toolchain-13.3.rel1-x86_64-aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu-gcc
> > > >
> > > > $ adb root
> > > > $ adb push packetdrill /data/
> > > > $ adb shell
> > > >
> > > > And here is what I did on my device
> > > >
> > > > erd9955:/data/packetdrill/gtests/net # ./packetdrill/run_all.py -S -v -L -l tcp/
> > > > /system/bin/sh: ./packetdrill/run_all.py: No such file or directory
> > > >
> > > > I'm not sure if this procedure is correct.
> > > > Could you help us run the Packetdrill on an Android device ?

BTW, Youngmin, do you have a packet trace (e.g., tcpdump .pcap file)
of the workload that causes this warning?

If not, in order to construct a packetdrill test to reproduce this
issue, you may need to:

(1) add code to the warning to print the local and remote IP address
and port number when the warning fires (see DBGUNDO() for an example)

(2) take a tcpdump .pcap trace of the workload

Then you can use the {local_ip:local_port, remote_ip:remote_port} info
from (1) to find the packet trace in (2) that can be used to construct
a packetdrill test to reproduce this issue.

thanks,
neal