2.0.26 problems

Zoltan Hidvegi (hzoli@cs.elte.hu)
Tue, 26 Nov 1996 19:48:49 +0100 (MET)


The biggest problem is random kernel stack corruption aiee messages. This
is present in all earlier kernel versions as well. It is not frequent, in
the last 36 hours:

Nov 25 11:35:05 neumann kernel: release: Xreset kernel stack corruption. Aiee
Nov 25 13:22:37 neumann kernel: release: Xreset kernel stack corruption. Aiee
Nov 25 13:47:52 neumann kernel: release: Xreset kernel stack corruption. Aiee
Nov 25 15:41:25 neumann kernel: release: touch kernel stack corruption. Aiee
Nov 25 15:50:00 neumann kernel: release: Xreset kernel stack corruption. Aiee
Nov 25 16:06:50 neumann kernel: release: Xreset kernel stack corruption. Aiee
Nov 26 11:58:43 neumann kernel: release: Xreset kernel stack corruption. Aiee
Nov 26 14:06:51 neumann kernel: release: Xreset kernel stack corruption. Aiee
Nov 26 15:29:58 neumann kernel: release: Xreset kernel stack corruption. Aiee
Nov 26 17:16:56 neumann kernel: release: touch kernel stack corruption. Aiee

Xreset is a simple script:

#!/usr/local/bin/zsh -f
#
# Xreset
#
# This program is run as root after the session terminates but
# before the display is closed
#

[[ "$DISPLAY" = (|localhost|unix|${HOST%%.*}):[0-9](|.[0-9]) ]] &&
DISPLAY="${HOST}:${DISPLAY##*:}"

[[ -f $HOME/.display && "$(<$HOME/.display)" = "$DISPLAY" ]] &&
rm -f $HOME/.display

The machine runs xdm login on about 7 ncd X terminal, on one Linux X
terminal and on the console so Xreset is executed often. I have seen Aiee
messages from other processes. The full list: TakeConsole (/bin/sh -> bash
script), Xstartup (zsh script), elm, latex, m4 (executed by fvwm to parse
the fvwmrc), more, quota, sh and touch. All of these produced this only
once except Xreset (61), more (7) and touch (3). The machine use lot of
NFS traffic and the xdm scirpts are running from an NFS mounted directory,
and all other programs mentioned are related to nfs (more shows nfs mounted
files, latex loads its fmt from nfs, m4 gets its input from nfs, elm reads
/var/spool/mail from nfs, and many users run touch scripts on an nfs
mounted scratch directory to prevent automatic file deletion work on them).

We recently replaced the motherboard of the mancine and it did not change
anything. The new motherboard is a dual PPro 200, Intel 440FX, SMP kernel,
onboard Adaptec 7880 Ultra Wide SCSI. The old MB had a P133 in an Intel
430VX based MB and an Adaptec 2940 Ultra SCSI-2 card (non-wide) with
non-smp kernel.

The machine also has a MiroVodeo 40SV (S3 968 based, 4 MB VRAM) VGA card
and an SMC Ultra ISA network card.

Other than this Aiee message there is no other relared info in the kernlog
and the machine is very stable otherwise (the MB was recently replaced, and
with the old P133 MB the average load was over 2 during the afternoons and
it was often more than 5). With the SMP board I get lots of Ignoring P6
Local APIC Spurious Interrupt Bug... messages but that's normal as I know
caused by a known PPro bug.

I heard this Aiee problem from other people as well and the common point
was the Adaptec 7xxx card (maybe NFS was also a common point, I do not
know). We also have heavily loaded 5x86/133 server with SMC Ultra and
ncr53c810 and it does not have this aiee problem.

% cat /proc/scsi/aic7xxx/0
Adaptec AIC7xxx driver version: 4.0/3.2/4.0

Compile Options:
AIC7XXX_RESET_DELAY : 5
AIC7XXX_CMDS_PER_LUN : 8
AIC7XXX_TWIN_SUPPORT : Enabled
AIC7XXX_TAGGED_QUEUEING: Enabled
AIC7XXX_PAGE_ENABLE : Enabled
AIC7XXX_PROC_STATS : Disabled

Adapter Configuration:
SCSI Adapter: AIC-7880 Ultra
(AIC-788x chipset)
Host Bus: Wide
Base IO: 0xec00
IRQ: 11
SCBs: Used 10, HW 16, Page 255
Interrupts: 249986
Serial EEPROM: True
Extended Translation: Enabled
SCSI Bus Reset: Enabled
Ultra SCSI: Enabled
Target Disconnect: Enabled
% cat /proc/scsi/scsi
Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
Vendor: Quantum Model: XP34300W Rev: L912
Type: Direct-Access ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 01 Lun: 00
Vendor: SANYO Model: CRD-400I Rev: 1.32
Type: CD-ROM ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 03 Lun: 00
Vendor: HP Model: C1533A Rev: 9503
Type: Sequential-Access ANSI SCSI revision: 02

I also used the adaptec driver with the default parameters with the same
results.

The other problem is SMP related. When changing the keyboard leds under X
with CapsLock the machine sometimes freezes for a few seconds. During that
it does not respond to ping and the mouse pointer does not move. The
machine comes back to life after a few seconds.

The third problem is related to bridging and ppp. Take a normal ppp
dial-in server on an ethernet network whith IP-forwarding enabled. Compile
a kernel with experimental bridge support. Start a ppp connection and
issue brcfg -ena. After that the remote ppp client stopps seing the ppp
server but it can still speak with the rest of the world. Everything comes
back to normal immediately after brcfg -dis. I tested this with ppp-2.3b3
but I guess it is probably reproducible with older ppp versions.

Zoltan