Re: EB164 extreme wierdness

Ian Pratt (Ian.Pratt@cl.cam.ac.uk)
Wed, 10 Apr 1996 19:29:28 +0100


> I've actually seen this myself on my eb164, with "uname" and "touch".
>
> I thought it was a library issue, because it has gone away for me since I
> recompiled those binaries (actually, I recompiled all the shellutils etc
> at the same time), and I ignored the problem.
>
> However, the symptoms certainly _sound_ like there is something wrong in
> the context invalidate code for the eb164, and maybe the reason I haven't
> seen it after the recompile is just luck rather than a library
issue.

Perhaps I'm not using the latest and greatest libc? I'm using
libc-0.40.3 from azstarnet, plus obvious patches to io.c in order
for it to understand the Alcor chipset (gross...)

> When it happened with "touch" for me, I could make the problem go away by
> simply doing another command in between. However, I for other resons
> suspected that it was an argument/environment problem, so I just took
> that as a confirmation of my suspicion (bash will modify the environment
> variable "_" for different commands, and for some strange reason I
> thought that would make a difference. I probably need to have my brain
> checked out some day).

> That was one reason I suspected it was an environment thing: running the
> thing under gdb will result in a different argument/environment setup.
> Have you recompiled you shell too?

We were suspicious of the environment for a while, but then we
managed to make the problem manifest with both sh and
bash. (Perhaps sh is just a cut down bash? Hmm...)

>
> I guess I need to reconsider the implications of my earlier problems.
> Looks like maybe there is something that results in incorrect TLB entries
> under some circumstances. A missing invalidate somewhere that is brought
> up by the ev5 ASN-marked tlb entries that can be cached across context
> switches..
>

We think this could well be plausible because of the PC/RP values
we've observed when things memory fault. We've had two instances
of the PC being the first instruction of a libc function (likely
to TLB miss). On another occasion, the RP was a plausible
address, but didn't point to anywhere near a branch/jump instr
(executing the wrong code perhaps?)

We've tried getting the fault handler to dump memory around the
faulting PC. The text was OK, but then we were looking at memory
trhrough the D-cache (as opposed to I-cache) in kernel mode, and
thus is by no means conclusive.

We've tried exercising the buffer cache heavily in the
background, but any effect was not obvious.

To exercise the paging code we've written a test prog that writes
test patterns into a 100MB swathe of VM and loops checking them
(we have 64MB physical ram). We have also written a program that
checks register save/restore. Both tests pass fine, though the
memory test seems to cause bash to take a memory violation about
every 30s. Bash handles the segv and continues - I suspect it
just fails to check a malloc return code...

Thanks for your help,
Ian