For those who like Linux war stories, read on:
I decided to get out my debugger and get to the bottom of this one. Rationally
it would have been cheaper (with a 99% chance of sucess) just to order a new
Intel Pentium chip, but I wanted to learn about the Linux Kernel (and maybe pay
back an infinitesimal amount to the Linux world). After all we are replacing a
DM 12000,-- Solaris box with this thing, so a moral debt exists.
I am using the SuSE 4.2 (May 96) distribution of Linux, the kernel version is
2.0.0.
The machine is a so-called “Vobis Network Computer”, a very good buy at DM
880,--. I liked the housing (except there is barely enough room to insert new
memory), and was very impressed with the box till this happened. The mainboard
is a no-name PA-2002 (must have been built by Santa’s elves (the manual even has
an empty copywrite, i.e. Copywrite )), a VIA Apollo Master chipset.
The CPU is an AMD 5k86 - p75 CPU, with the additional identifiers
AMD-SSA/5-75ABR, E 9619BPE, 1996 Malaysia.
The BIOS is variously identified by Award Modular Bios v4.51PG at startup,
Version VBS1.11BP FIC PA2002 Rev1.21 (this latter might be a mainboard id), and
in the ROM setup program we have ROM ISA/ISA BIOS (2A5L9F09). Whew...
I tried adding more memory, turning off the swap file, rebuilding the kernel
without Pentium optimizations. No help. Sooner or later (usually after an hour
or so) a process would croak, usually (but not always) named (the BIND
nameservice daemon). The location was always the same though, EIP: 0010:0010eb66
Looking at the dump (and playing with the good old dos debug.exe (someone tell
me what I could have used under Linux?)), I entered the code and registers and
saw that the problem was that 0x26f80000000 (edx:eax) was being divided by 0xf54
(ecx), which of course causes an overflow. edx:eax was always the same, but ecx
varied from 0xf54 to 0x2676.
Looking at the System.map (which I noticed lying around) I saw that it died in
“do_fast_timeoffset” in the source file “kernel/time.c”. There was an old
alternate routine (do_slow_timeoffset) that is used when CONFIG_APM (Advanced
Power Mangament) is defined, obstensiably because the “cycle counter is not
reliable” This might shed light on the problem, i.e. although I have power
managment turned off, it maybe that my bios is in error, and it is actually on.
On the other hand maybe my board/chip combination also delivers a “non-reliable
cycle counter”. All this is wild speculation of course, the only thing I know
for sure is that do_fast_timeoffset is trying to do doing bozo divisions.
Anyway I have defined CONFIG_APM and rebuilt the kernel and it seems to work
now. Some kernel expert who knows about those routines ought to have a look at
this problem, it might be particular to this board/chip combo, or it might be
something more generic to the AMD K5. Perhaps a guard would be in order here. I
am willing to try out kernel patches on this machine, at least a few, though it
is to be a production machine. There will certainly be a lot of these guys
(Vobis Network Computers) out there soon, since they are really cheap (in
Germany).
I hope that this will help some people,
Mike Wise
-----------------------------------------------------------------------
This article was posted to Usenet via the Posting Service at Deja News:
http://www.dejanews.com/ [Search, Post, and Read Usenet News!]