TCP code (possible?) problem in 2.0.30 (long)

Doug Ledford (dledford@dialnet.net)
Thu, 01 May 1997 06:00:14 -0500


While running a program the other day that is known to chew up system
resources, not the least of which is the fact that it will almost fill up
the process table with programs halted in a disk wait, I had some of the
children processes and up seg faulting on me with general protection
faults. I didn't think too much about it since the machine kept running,
but I did note that this happened after the machine actually managed to
fill the process table. Well, that was a day or two ago, and tonight I
noticed on the same machine that netstat no longer worked. It segfaulted
every time I tried to run it. So, I went in the the /proc/net directory
and started poking, and found out that even a simple cat of the file tcp
would cause cat to seg fault with a general protection fault. This started
after filling up the process table, it was sendmail that first segfaulted,
and now everything that tried to read /proc/net/tcp seg faults and produces
the following info:

general protection: 0000
CPU: 0
EIP: 0010:[get__netinfo+334/684]
EFLAGS: 00010202
eax: 61657262 ebx: 06dd4810 ecx: 00000000 edx: 00000000
esi: 00000000 edi: 00000080 ebp: 00000000 esp: 04e65e78
ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018
Process cat (pid: 6684, process nr: 52, stackpage=04e65000)
Stack: 043ed000 001a1d3c 07c49380 00001000 ffffffff 00000000 04e65eb0 00003280

00000063 00000050 ffff0650 e5f941ce 4b37a6c2 00000000 38392020 3545203a

31343946 303a4543 20303530 33433141 37434541 3634303a 37302044 30303020

Call Trace: [write_chan+247/400] [tcp_get_info+33/40] [proc_readnet+173/324]
[sys_read+179/216] [system_call+85/128]
Code: 8b 40 04 89 44 24 14 8b 54 24 14 52 31 c0 85 f6 74 06 8b 83
Aiee, killing interrupt handler

Code: 1410aa <get__netinfo+14e/2ac> movl 0x4(%eax),%eax
Code: 1410ad <get__netinfo+151/2ac> movl %eax,0x14(%esp,1)
Code: 1410b1 <get__netinfo+155/2ac> movl 0x14(%esp,1),%edx
Code: 1410b5 <get__netinfo+159/2ac> pushl %edx
Code: 1410b6 <get__netinfo+15a/2ac> xorl %eax,%eax
Code: 1410b8 <get__netinfo+15c/2ac> testl %esi,%esi
Code: 1410ba <get__netinfo+15e/2ac> je 1410c2 <get__netinfo+166/2ac>
Code: 1410bc <get__netinfo+160/2ac> movl 0x90909000(%ebx),%eax

Now, that was what I'm getting now with everything seemingly doing all right.
However, the initial seg fault occurred differently:

general protection: 0000
CPU: 0
EIP: 0010:[tcp_do_retransmit+856/1376]
EFLAGS: 00010217
eax: 0000000e ebx: 07a7df67 ecx: 00000003 edx: 00000003
esi: 07a7df67 edi: fffffff2 ebp: fffffff2 esp: 073bde50
ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018
Process sendmail (pid: 13174, process nr: 283, stackpage=073bd000)
Stack: 03157810 00000001 00000001 073bded8 031579d8 f000ef6f fffffffc 00000000

00000004 0740e698 001c0c50 0242d584 0014ba4a 03157810 00000000 03157810

0014babf 03157810 00000000 0014bc77 03157810 00000000 03157810 0014bd0e

Call Trace: [tcp_retransmit_time+22/120] [tcp_retransmit+19/72]
[tcp_time_write_timeout+19/32] [tcp_retransmit_timer+138/224]
[tcp_retransmit_timer+0/224] [timer_bh+749/820] [do_bottom_half+59/96]
[inet_create+0/848] [handle_bottom_half+11/32] [inet_create+0/848]
[get_empty_filp+180/216] [inet_create+0/848] [get_fd+22/124]
[inet_create+0/848] [sys_socket+261/324]
[inet_create+0/848] [sys_socketcall+250/732] [system_call+85/128]
Code: f3 a5 a8 02 74 02 66 a5 a8 01 74 01 a4 8b 4c 24 24 8b 41 20
Aiee, killing interrupt handler

Code: repz movsl %ds:(%esi),%es:(%edi)
Code: testb $0x2,%al
Code: je 00000008 <_EIP+8>
Code: movsw %ds:(%esi),%es:(%edi)
Code: testb $0x1,%al
Code: je 0000000d <_EIP+d>
Code: movsb %ds:(%esi),%es:(%edi)
Code: movl 0x24(%esp,1),%ecx
Code: movl 0x20(%ecx),%eax

Followed by another seg fault from the same program:

general protection: 0000
CPU: 0
EIP: 0010:[locks_remove_locks+12/56]
EFLAGS: 00010286
eax: f000ef6f ebx: 04c06414 ecx: 00000000 edx: 002aa0f4
esi: 00000000 edi: f000ef6f ebp: 01999018 esp: 073bdd94
ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018
Process sendmail (pid: 13174, process nr: 283, stackpage=073bd000)
Stack: 00000001 0012224f 04c06414 00000000 00000001 00000005 00000001 00116786
00000000 0000002b 00000014 073be000 073bde14 0010ab33 0000000b 0019bb7a
00000000 07a7df67 fffffff2 fffffff2 00000020 09000000 08800000 00000018
Call Trace; [close_fp+55/92] [do_exit+274/492] [die_if_kernel+695/704]
[<09000000>] [dummy0:dummy_init+-311340/552]
[do_general_protection+40/84] [do_general_protection+0/84]
[error_code+64/80] [tcp_do_retransmit+856/1376]
[tcp_retransmit_time+22/120] [tcp_retransmit+19/72]
[tcp_time_write_timeout+19/32] [tcp_retransmit_timer+138/224]
[tcp_retransmit_timer+0/224] [timer_bh+749/820]
[do_bottom_half+59/96] [inet_create+0/848] [handle_bottom_half+11/32]
[inet_create+0/848] [get_empty_filp+180/216] [inet_create+0/848]
[get_fd+22/124] [inet_create+0/848] [sys_socket+261/324]
[inet_create+0/848] [sys_socketcall+250/732] [system_call+85/128]
Code: 8b 50 50 85 d2 74 22 f6 42 1c 01 74 0f 53 83 c0 50 50 e8 15

Code: movl 0x50(%eax),%edx
Code: testl %edx,%edx
Code: je 00000029 <_EIP+29>
Code: testb $0x1,0x1c(%edx)
Code: je 0000001c <_EIP+1c>
Code: pushl %ebx
Code: addl $0x50,%eax
Code: pushl %eax
Code: call 9090002c <_EIP+9090002c>
Code: nop

Anyway, that's what I could put together out of the log files. I'll let
someone who actually knows about the net code look at it now :)

I should also mention that I have been getting the notoriuos message:
Apr 28 17:06:31 shell kernel: TCP: **bug**: copy=0, sk->mss=0
Apr 28 17:06:32 shell last message repeated 410 times
Apr 28 17:06:32 shell kernel: TCP: **bug**: copy=0, sk->mss=0
Apr 28 17:06:32 shell last message repeated 114 times
Apr 28 17:06:32 shell kernel: TCP: **bug**: copy=0, sk->mss=0
Apr 28 17:06:32 shell last message repeated 202 times
Apr 28 17:06:32 shell kernel: TCP: **bug**: copy=0, sk->mss=0
Apr 28 17:06:33 shell last message repeated 401 times
Apr 28 17:06:33 shell kernel: TCP: **bug**: copy=0, sk->mss=0
Apr 28 17:06:33 shell last message repeated 234 times

This is from the same machine and is a day before the original seg fault
above. So do other people see this bug message in the quantity above
(there's 1366 of them happening within a few seconds)?

And for the curios, here's the output of procinfo:
Linux 2.0.30 (root@shell) (gcc 2.7.2) #2 Sun Apr 6 23:51:32 CDT 1997 [shell]

Memory: Total Used Free Shared Buffers Cached
Mem: 127984 126960 1024 23076 2608 68728
Swap: 133112 120 132992

Bootup: Mon Apr 07 03:58:05 1997 Load average: 1.24 1.19 1.18 3/64 8510

user : 2d 10:13:13.76 10.1% page in : 27321187 disk 1: 2177299r 6725588w
nice : 1d 11:42:34.36 6.2% page out: 33316868 disk 2: 3916181r14134799w
system: 1d 15:19:32.36 6.8% swap in : 1768438
idle : 18d 12:42:07.53 76.9% swap out: 74780
uptime: 24d 1:57:28.00 context : 298504275

irq 0: 208064801 timer irq 8: 0 + rtc
irq 1: 38140 keyboard irq 9: 0
irq 2: 0 cascade irq 10: 26952286 + aic7xxx
irq 3: 0 irq 11: 218480150 21140
irq 4: 0 irq 12: 0
irq 5: 0 irq 13: 1 math error
irq 6: 2 irq 14: 0
irq 7: 0 irq 15: 0

-- 
*****************************************************************************
* Doug Ledford                      *   Unix, Novell, Dos, Windows 3.x,     *
* dledford@dialnet.net    873-DIAL  *     WfW, Windows 95 & NT Technician   *
*   PPP access $14.95/month         *****************************************
*   Springfield, MO and surrounding * Usenet news, e-mail and shell account.*
*   communities.  Sign-up online at * Web page creation and hosting, other  *
*   873-9000 V.34                   * services available, call for info.    *
*****************************************************************************