EB164 extreme wierdness

Ian Pratt (Ian.Pratt@cl.cam.ac.uk)
Wed, 10 Apr 1996 16:55:40 +0100


We have three EB164's which all exhibit some very strange
behaviour. We often find that simple commands like 'uname','tar',
'mv', 'id', 'cp' fail sporadically, even when invoked with
trivial or no arguments e.g. 'uname','id','tar --v', 'mv --v'.

When a particular command has decided to get itself into this
state, invoking it repeatedly will cause it to fail with a memory
violation at the same PC. Leave the shell alone for 10 seconds
and try it again, and the command will magically work. Repeating
it immediately will cause it to fail at the same PC again. Once
into this state, the behaviour is totally repeatable!

Switching shell e.g. sh to bash or vice versa can cause the
problem to go away. While the problem is being exhibited on one
virtual console, it can be fine on another. We do get a core when
they fail, but the version of gdb we have fails to understand
core files - Has anyone fixed this? Running things under gdb
invariably makes them work fine.

We've put register dump code into arch/alpha/mm/fault.c, and
found that the PC is sometimes near zero, and the RP looks
plausible, but doesn't point anywhere near a branch/jump instruction.

Sometimes the PC is OK, and points at a LD/ST instruction, but the
register being indirected off appears to contain garbage. We've
rebuilt uname/tar/grep with symbols, and found that the PC value
where the memory violation occurred is often within libc (we've
seen it go in getopts and strlen amongst others).

We've rebuilt libc from libc-0.40.3 from azstarnet, and relinked
uname/tar/grep etc. This hasn't helped.

All our EB164's are Redhat-2.1 axp installed, and the problem
exists with all kernel versions we've tried up to and including
1.3.85. All the kernels have been patched to disable DISCONNECTS
in the 53c810 driver.

The exact same binaries run fine on a Redhat-2.1 EB66+

Can anyone shed any light on this please ?

Thanks,
Ian