2.0.3x I found something unique...

Rick Bressler (rick.bressler@boeing.com)
Fri, 25 Sep 1998 12:32:50 -0700 (PDT)


I ran this by Alan Cox, and this was his response:

> Thats like the task fell off the run queues. Ok consider me outweirded
> by that one. Its certainly not one of the standard bug reports I get

I figured that if this was a new one for Alan, I wouldn't be wasting
this groups time with a 'known' problem.

I've seen this problem at least from 2.0.30 on up through 34. I have
not yet been able to install {35,36pre} on this box as this is a
production system that I get limited outages on. It may be related to
smbfs or the eepro, or eepro100 drivers, as those are the only 'unique'
things about this installation.

Actually, there are two boxes configured the same and I see it on both,
so it is not likely related to defective hardware. (Both boxes are
production systems. Of course I can't duplicate this on any of my
development boxes. :-(

At any rate, the 'problem' is processes 'stuck' in the running state, but
they are not actually running. These pop up pretty consistently every
2-4 million processes (yes, this is a very busy system) which on this
machine means every couple of weeks or less. They hang around until a
reboot.

USER PID %CPU %MEM SIZE RSS TTY STAT START TIME COMMAND
bressler 15707 0.0 1.3 1032 420 ? R NSep 20 0:00 ksh bin/security.scan

FLAGS UID PID PPID PRI NI SIZE RSS WCHAN STA TTY TIME COMMAND
0 975 15707 1 19 19 1032 420 0 R N ? 0:00 ksh bin/security.scan

For a process that has been 'running' since Sep 20, it hasn't
accumulated much CPU. :-)

>From /proc/15707/stat*

15707 (security.scan) R 1 23 23 0 -1 0 1295 2827 115 10542 11 20 73 51 19 19 0 0 35252415 1056768 105 2147483647 134512640 134658403 3221224888 3221223860 1074229440 344322 0 2147483654 81921 0 0 0

105 105 77 31 0 74 28

Name: security.scan
State: R (running)
Pid: 15707
PPid: 1
Uid: 975 975 975 975
Gid: 200 200 200 200
VmSize: 1032 kB
VmLck: 0 kB
VmRSS: 420 kB
VmData: 276 kB
VmStk: 8 kB
VmExe: 144 kB
VmLib: 576 kB
SigPnd: 00054102
SigBlk: 00000000
SigIgn: 80000006
SigCgt: 00014001

It is running all the time but not on a processor! This particular
process happens to be a pdksh, but seems to affect any process. I've
seen grep, ping etc all in the same state at one time or another. They
hang around until a reboot.

You can see I've sent kill signals, and even though they are posted and
not blocked, they are not received. I've just started poking through
kernel code and am toying with the idea of poking status into the
process table entry. I'm wondering if it is possible to force a schedule.

This machine runs a LOT of processes.

9:44pm up 8 days, 2:43, 2 users, load average: 1.06, 1.14, 1.11

cpu 1125173 934598 1890497 66147489
disk 1190748 1 0 0
disk_rio 176651 1 0 0
disk_wio 1014097 0 0 0
disk_rblk 353368 4 0 0
disk_wblk 2028212 0 0 0
page 2963286 3366647
swap 11 4
intr 78458500 70097757 8 0 0 0 5250593 2 0 0 0 0 1919368 0 1 1190771 0
ctxt 30032585
btime 905997664
processes 3911319

I'd appreciate any ideas on how to proceed. Obviously, I'd like to
contribute towards tracking down and fixing the problem, but I'll settle
for a way to make the processes disappear short of having to boot the
system.

As a side note, I have several systems around here that have uptimes in
the 250-300 day range. My longest uptime was 477 days. Linux is
gathering a serious following around here.. :-) Thanks for all the hard
work!

-- 
+--------------------------------------------+ Rick Bressler
|Mushrooms and other fungi have several      |
|important roles in nature.  They help things| 
|grow, they are a source of food, they       | bressler@mushroom.ca.boeing.com
|decompose organic matter and they           |
|infect, debilitate and kill organisms.      | Linux: Because a PC is a
+--------------------------------------------+ terrible thing to waste.

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/