Re: Please HELP! System hangs frequently... (better format)

Koles Sandor (koles@elender.hu)
Thu, 22 Apr 1999 21:14:15 +0200


[1.]
System hangs frequently after several (10-20) Oops

[2.]
Since we had changed the kernel from 2.0.30 to 2.2.x, the system hangs
frequently.
Nothing work except I can ping the box from other machines. The SysRq key
cannot do an Emergency Remount ro, but that writes to the console. The other
SysRq functions seems to work, but do not nothing (after TERM or KILL signal
the tasks remains if I press the Alt-SysRq-t keys).
After reboot (Alt-SysRq-b) the filesystem check begins... about 30-40
minutes
(34 GB).

[3.]
Keywords: Kernel, SMP

[4.]
Linux version 2.2.6-ac1 (root@leon.online.hu) (gcc version 2.7.2.3) #2 SMP
Thu Apr 22 14:39:06 MET 1999

[5.]
Output of ksymoops (only the first two Oops):

Options used: -V (default)
-o /lib/modules/2.2.6-ac1/ (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-m /usr/src/linux/System.map (default)
-c 1 (default)
-a - same as ksymoops (default)

You did not tell me where to find symbol information. I will assume
that the log matches the kernel and modules that are running right now
and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

No modules in ksyms, skipping objects
Apr 22 15:30:50 leon kernel: Unable to handle kernel NULL pointer
dereference at virtual address 00000008
Apr 22 15:30:50 leon kernel: current->tss.cr3 = 14360000, pr3 = 14360000
Apr 22 15:30:50 leon kernel: *pde = 00000000
Apr 22 15:30:50 leon kernel: Oops: 0000
Apr 22 15:30:50 leon kernel: CPU: 0
Apr 22 15:30:50 leon kernel: EIP: 0010:[<c0130dae>]
Apr 22 15:30:50 leon kernel: EFLAGS: 00010282
Apr 22 15:30:50 leon kernel: eax: 00000000 ebx: cdd2d200 ecx: cdd2d200
edx: cdd2d200
Apr 22 15:30:50 leon kernel: esi: cdd2d200 edi: 00000000 ebp: 00000001
esp: d3fd5f44
Apr 22 15:30:50 leon kernel: ds: 0018 es: 0018 ss: 0018
Apr 22 15:30:50 leon kernel: Process smbd (pid: 2078, process nr: 193,
stackpage=d3fd5000)
Apr 22 15:30:50 leon kernel: Stack: 00000000 00000001 d3bafcb0 00000004
ffffffeb d3bafcb0 c012d77a d3bafcb0
Apr 22 15:30:50 leon kernel: 00000004 cdd2d200 ffffffe9 00000000
ef92e000 00000000 d3fd4000 c012f916
Apr 22 15:30:50 leon kernel: bfffd758 c012f95b c0127009 cdd2d200
d3fd4000 c0125ff2 cdd2d200 d3fd4000
Apr 22 15:30:50 leon kernel: Call Trace: [<c012d77a>] [<c012f916>]
[<c012f95b>] [<c0127009>] [<c0125ff2>] [<c0107aec>]
Apr 22 15:30:50 leon kernel: Code: 8b 40 08 89 44 24 10 8b 6c 24 10 83 c5 70
8b 54 24 10 8b 72
Copying default arch from ksymoops, bfd_arch=8 bfd_mach=0

>>EIP; c0130dae <locks_remove_flock+e/88>
Trace; c012d77a <open_namei+1ca/2f4>
Trace; c012f916 <sys_getdents+f6/160>
Trace; c012f95b <sys_getdents+13b/160>
Trace; c0127009 <fput+11/48>
Trace; c0125ff2 <sys_lseek+ba/e0>
Trace; c0107aec <system_call+34/38>
Code; c0130dae <locks_remove_flock+e/88> 00000000 <_EIP>:
Code; c0130dae <locks_remove_flock+e/88> 0: 8b 40 08 movl
0x8(%eax),%eax
Code; c0130db1 <locks_remove_flock+11/88> 3: 89 44 24 10 movl
%eax,0x10(%esp,1)
Code; c0130db5 <locks_remove_flock+15/88> 7: 8b 6c 24 10 movl
0x10(%esp,1),%ebp
Code; c0130db9 <locks_remove_flock+19/88> b: 83 c5 70 addl
$0x70,%ebp
Code; c0130dbc <locks_remove_flock+1c/88> e: 8b 54 24 10 movl
0x10(%esp,1),%edx
Code; c0130dc0 <locks_remove_flock+20/88> 12: 8b 72 00 movl
0x0(%edx),%esi

Apr 22 15:30:50 leon kernel: Unable to handle kernel NULL pointer
dereference at virtual address 00000008
Apr 22 15:30:50 leon kernel: current->tss.cr3 = 00101000, pr3 = 00101000
Apr 22 15:30:50 leon kernel: *pde = 00000000
Apr 22 15:30:50 leon kernel: Oops: 0000
Apr 22 15:30:50 leon kernel: CPU: 0
Apr 22 15:30:50 leon kernel: EIP: 0010:[<c0125db0>]
Apr 22 15:30:50 leon kernel: EFLAGS: 00010246
Apr 22 15:30:50 leon kernel: eax: 00000000 ebx: cdd2d200 ecx: 00000008
edx: cdd2d200
Apr 22 15:30:50 leon kernel: esi: 00000000 edi: 00000000 ebp: d7367dc0
esp: d3fd5ea0
Apr 22 15:30:50 leon kernel: ds: 0018 es: 0018 ss: 0018
Apr 22 15:30:50 leon kernel: Process smbd (pid: 2078, process nr: 193,
stackpage=d3fd5000)
Apr 22 15:30:50 leon kernel: Stack: d7367dc0 00000001 c0117bd5 cdd2d200
d7367dc0 d3fd5f08 d3fd4000 00000008
Apr 22 15:30:50 leon kernel: d50cc020 00000008 d3fd4000 c0107ffb
0000000b 00000000 c010ecf2 c01e389a
Apr 22 15:30:50 leon kernel: d3fd5f08 00000000 d3fd4000 cdd2d200
00000000 00000001 d50c9ce0 c0107bf1
Apr 22 15:30:50 leon kernel: Call Trace: [<c0117bd5>] [<c0107ffb>]
[<c010ecf2>] [<c01e389a>] [<c0107bf1>] [<c0130dae>] [<c012d77a>]
Apr 22 15:30:50 leon kernel: [<c012f916>] [<c012f95b>] [<c0127009>]
[<c0125ff2>] [<c0107aec>]
Apr 22 15:30:50 leon kernel: Code: 83 7f 08 00 74 0a 55 53 e8 57 af 00 00 83
c4 08 53 e8 32 12

>>EIP; c0125db0 <filp_close+44/64>
Trace; c0117bd5 <do_exit+17d/2d0>
Trace; c0107ffb <die+53/54>
Trace; c010ecf2 <do_page_fault+2d2/324>
Trace; c01e389a <lk_lockmsg+1183/1209>
Trace; c0107bf1 <error_code+2d/34>
Trace; c0130dae <locks_remove_flock+e/88>
Trace; c012d77a <open_namei+1ca/2f4>
Trace; c012f916 <sys_getdents+f6/160>
Trace; c012f95b <sys_getdents+13b/160>
Trace; c0127009 <fput+11/48>
Trace; c0125ff2 <sys_lseek+ba/e0>
Trace; c0107aec <system_call+34/38>
Code; c0125db0 <filp_close+44/64> 00000000 <_EIP>:
Code; c0125db0 <filp_close+44/64> 0: 83 7f 08 00 cmpl
$0x0,0x8(%edi)
Code; c0125db4 <filp_close+48/64> 4: 74 0a je
10 <_EIP+0x10> c0125dc0 <filp_close+54/64>
Code; c0125db6 <filp_close+4a/64> 6: 55 pushl
%ebp
Code; c0125db7 <filp_close+4b/64> 7: 53 pushl
%ebx
Code; c0125db8 <filp_close+4c/64> 8: e8 57 af 00 00 call
af64 <_EIP+0xaf64> c0130d14 <locks_remove_posix+0/8c>
Code; c0125dbd <filp_close+51/64> d: 83 c4 08 addl
$0x8,%esp
Code; c0125dc0 <filp_close+54/64> 10: 53 pushl
%ebx
Code; c0125dc1 <filp_close+55/64> 11: e8 32 12 00 00 call
1248 <_EIP+0x1248> c0126ff8 <fput+0/48>

[6.]
We don't know what program triggers the problem :(

[7.]
Environment:

Dell 6100 4xPPRO 200Mhz, 512Kb Cache
AMI Megaraid controller, 6x9 GB Hard Disk, HW Raid5, one big partition
768 MB RAM
Now: SMP Linux 2.2.6 with ac1 patch. (Previously: 2.2.3, 2.2.4, 2.2.5
without ac-patches)
Software: iBCS, Progress 8.2
The box is application and file server (with samba).
20-40 concurrent user.

We use the AMI Megaraid controller only since the new kernel. Before that we
used the on-board Adaptec controller.

[7.1.]
Software:

Linux leon.online.hu 2.2.6-ac1 #2 SMP Thu Apr 22 14:39:06 MET 1999 i686
unknown
Kernel modules 2.1.121
Gnu C 2.7.2.3
Binutils 2.9.1.0.15
Linux C Library 2.0.7
Dynamic linker ldd (GNU libc) 2.0.7
Linux C++ Library 2.8.0
Procps 2.0.0
Mount 2.8a
Net-tools 1.50
Kbd 0.96
Sh-utils 1.16
Modules Loaded iBCS

[7.2.]
Processor information (first, the other three same):

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 1
model name : Pentium Pro
stepping : 9
cpu MHz : 198.963828
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
sep_bug : no
f00f_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
bogomips : 198.25

[7.3.]
Module information:

iBCS 119908 101

[7.4.]
SCSI information:

Attached devices:
Host: scsi1 Channel: 00 Id: 05 Lun: 00
Vendor: NEC Model: CD-ROM DRIVE:462 Rev: 1.14
Type: CD-ROM ANSI SCSI revision: 02
Host: scsi1 Channel: 00 Id: 06 Lun: 00
Vendor: ARCHIVE Model: Python 00095-001 Rev: 5.ac
Type: Sequential-Access ANSI SCSI revision: 02
Host: scsi2 Channel: 00 Id: 06 Lun: 00
Vendor: DELL Model: 6UW BACKPLANE Rev: 7
Type: Processor ANSI SCSI revision: 02
Host: scsi2 Channel: 02 Id: 00 Lun: 00
Vendor: MegaRAID Model: LD0 RAID5 34272R Rev: U.75
Type: Direct-Access ANSI SCSI revision: 02

[7.5.]
With 2.0.30 kernel was an other problem: the IPI interrupt rarely
growed up (once per month) and the system gone down. Now that is
disappeared.

[X.] Usually there are 200-300 processes. ~20 smbd, ~80 _mprosrv,
~50 _progres. Because PROGRESS, we changed the semaphor limits in
sem.h: SEMMSL 32 --> 256, SEMOPM 32 --> 256. The NR_TASKS also
changed in tasks.h: 512 --> 1024.

In the rc.local:
echo 8192 > /proc/sys/fs/file-max
echo 32768 > /proc/sys/fs/inode-max

We tried with and without the ac patches.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/