Apparently wrong "Segmentation fault" errors

Jerome J. Erpenbeck (jje@tdo-serv.lanl.gov)
Thu, 18 Apr 1996 06:56:19 -0600


I have just subscribed to this newgroup and fear that this submission may not
be entirely appropriate. On the other hand, I am becoming somewhat desperate!

In an attempt to evaluate the suitability of using Pentium and Pentium Pro
based machines as compute-servers in a UNIX network environment, I have installed
Linux 1.2.13 (Slackware) on a (networked) noname 586/166Mhz system having 32Mbytes
of memory. I have also installed the same system on a standalone 486/DX2 machine
with 8M of RAM. We are interested in running very long running programs on such
machines, with emphasis on FORTRAN 77. Because some of our programs
make extensive use of dynamic memory allocation and pointers, as implemented in the
SUN, HP, and CRAY F77 extensions, we have installed the ABSOFT F77 compiler in
addition to the GNU g77. The problems which I will describe appear independent of
which compiler is involved and appear, then to have nothing to do with dynamic
memory allocation.

In the absence of compute-bound programs running, the machine seems to operate
well. However, with the one or more compute-bound programs running (say, using 3.5Mbytes
of RAM plus 17.5Mbytes of shared memory according to top), an attempt to start another
program will frequently fail with a "segmentation fault" printed on the terminal.
For example,

iriz:~/mcmd/rgd_mol/s> make FFLAGS=-O all
make DATE="`date`" FFLAGS=-O NS=1 /home/jje/bin/1.2.13/xmcmd_mxp_1
make[1]: Entering directory `/home/jje/mcmd/rgd_mol/s'
make clean
make[2]: Entering directory `/home/jje/mcmd/rgd_mol/s'
make[1]: *** [/home/jje/bin/1.2.13/xmcmd_mxp_1] Segmentation fault
make[1]: Leaving directory `/home/jje/mcmd/rgd_mol/s'
make: *** [all_1.2.13] Error 2
iriz:~/mcmd/rgd_mol/s>

In this particular case, no error is recorded in the /var/adm/syslog or messages file.
At other times, however, the messages file contains error messages such as,

Apr 10 19:00:01 iriz kernel: Oops: 0002
Apr 10 19:00:01 iriz kernel: EIP: 0010:0012ee20
Apr 10 19:00:01 iriz kernel: EFLAGS: 00010283
Apr 10 19:00:01 iriz kernel: eax: 001c207c ebx: 001c255c ecx: 001c35e0 edx: 353f2765
Apr 10 19:00:01 iriz kernel: esi: 0043f070 edi: 00213ee0 ebp: 00000000 esp: 01de3ecc
Apr 10 19:00:01 iriz kernel: ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018
Apr 10 19:00:01 iriz kernel: Process sh (pid: 12512, process nr: 3, stackpage=01de3000)
Apr 10 19:00:01 iriz kernel: Stack: 00000003 08054a00 00213ee0 00000010 08054a00 001c35f0 0014d5c5 00213ee0
Apr 10 19:00:01 iriz kernel: 0043f070 00000003 0001cf1b 010d09f0 00000001 00213ee0 08054a00 00000000
Apr 10 19:00:01 iriz kernel: 00213460 00000000 00000000 00000068 00000000 00000400 00000000 00000000
Apr 10 19:00:01 iriz kernel: Call Trace: 0014d5c5 0011cc97 0012254a 001102e9
Apr 10 19:00:01 iriz kernel: Code: 89 5a 2c 89 58 30 5b 5e 5f 5d 83 c4 08 c3 90 90 8b 1d d0 35

where the process could just as well have been make or tar. The machine seems to get
increasingly "sick" and sometimes crashes, with messages like the following
in the /var/adm/syslog file:

Apr 11 08:19:38 iriz kernel: Unable to handle kernel paging request at virtual address f53f2791
Apr 11 08:19:38 iriz kernel: current->tss.cr3 = 01b37000,
Apr 11 08:19:38 iriz kernel: *pde = 00000000

repeated many times. I have no idea whether I am dealing with hardware or kernel
problems. The only indication that it might be a problem with the kernel is that
I sometimes observe similar behavior on the 486/DX2 machine as well.

For some time I thought the system was unable to swap to disk, because top and
vmstat would report no usage of swap space (I have blocks of 12M, 12M and 10M
as swap in /etc/fstab which seems to be recognized properly when the machine
boots. However, at least in one case I was able to start a 29M process under
similar circumstances & that did cause lots of swapping. Indeed just now I was
able to start that same 29M process, even though I was unable to run "tar czf /dev/fd0
./s" which cause the following /var/adm/messages entry:

Apr 17 17:15:11 iriz kernel: invalid operand: 0000
Apr 17 17:15:11 iriz kernel: EIP: 0010:0014fb60
Apr 17 17:15:11 iriz kernel: EFLAGS: 00010202
Apr 17 17:15:11 iriz kernel: eax: 00000004 ebx: 00000000 ecx: 01c69ec4 edx: 007f7e90
Apr 17 17:15:11 iriz kernel: esi: 0000000c edi: 015f08c0 ebp: 007f7e98 esp: 007f7e38
Apr 17 17:15:11 iriz kernel: ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018
Apr 17 17:15:11 iriz kernel: Process tar (pid: 2388, process nr: 39, stackpage=007f7000)
Apr 17 17:15:11 iriz kernel: Stack: 0014d8f4 015f08c0 0000000c 00000000 007f7e90 0083e240 00002800 015f08c0
Apr 17 17:15:11 iriz kernel: 0801f000 00487004 00000000 00000001 00000400 00000304 007f7f14 001ce724
Apr 17 17:15:11 iriz kernel: 007f7e98 00000003 00000009 0000000d 00002600 00000200 ffffffe4 00c5dbf4
Apr 17 17:15:11 iriz kernel: Call Trace: 0014d8f4 00150c6d 00150c88 00150cb6 0012bbce 0012bc44 0012b9c0
Apr 17 17:15:11 iriz kernel: 0012afe8 0012294c 0011cc97 0012289c 001102e9
Apr 17 17:15:11 iriz kernel: Code: 83 ec 04 55 57 56 53 8b 74 24 1c 8b 54 24 18 8b 42 3c 8b 58

I would be delighted to hear from anyone who can suggest the source of the problem,
or even anyone who has seen similar problems.

--Jerry Erpenbeck
jje@lanl.gov