Debugging a kernel that fails to boot, was Re: Rookie 2.3.x compile problems & more & more

From: William Stearns (wstearns@pobox.com)
Date: Tue Apr 25 2000 - 22:26:05 EST


Good afternoon, Bill,
        (To the other denizens of linux-kernel; I'm CC'ing the list in the
hopes that the following ideas prove useful to others trying to figure out
why a given kernel fails to boot. One does _not_ neccessarily need to be
a kernel guru or even programmer to make headway into troubleshooting a
problem like the following.)

On Tue, 25 Apr 2000, William Waddington wrote:

> Hello again Mr. Stearns,

        (Feel free to use Bill - it's such a distinguished name that
that's all the formality I need. :-)

> [insert profuse apologies for taking your time]
>
> With your kind help, I have managed to compile 2.3.99-pre3 and actually get my
> user-buffer-DMA device driver working.

        Marvelous!

> At least, it works fine on my home machine where I do my driver hacking (long recovery
> from knee surgery). It might work just fine on my work machine, where I do the serious
> driver and hardware testing, if only I could compile a kernel that would boot.
>
> On the work machine, 2.3.99-pre3 seems to compile OK, but when I try to boot it, it gets
> to:
>
> Uncompressing Linux......OK, booting the kernel
>
> and just hangs. It is wedged solid - takes a reset to reboot. This is on a 167MHz
> Pentium, running RH 6.0. On the chance that my compiler was stale, I did a full upgrade
> to RH 6.2 -- no joy, same problem. The home machine is a 450MHz P-III running RH 6.1.
>
> I guess I am bothering you again on the chance that this might a build configuration
> issue, or if not, that it shows up somewhere else in your shrunken head collection.

        From that last comment, it's clear you must have read my code.
*smile*
        I hate to say it, but I don't know. Might I offer a few pointers
of what to check?

        - Take a look at the kernel image file in /boot. Is it larger
than 200K or smaller than 60K? The former tends to point to a successful
build, the latter almost certainly means that a file was created, but it
isn't a complete kernel.
        - Especially if the above is true and the kernel's too small, read
through the log file buildkernel creates for each build; see
/usr/src/configs. Are there any errors in the build process?
        - Try rebuilding the kernel as a bzImage file (set
BKBUILDTYPE=bzImage
        in /etc/bkrc) and rebuild. While you're going through the build,
quickly recheck the kernel configuration.
        - Do you have a similar machine that _does_ boot that kernel
successfully? If so, what are the hardware differences between them?
        - Try building the latest pre-release, currently
2.3.99-pre6-pre7. There's always a chance that your problem is a known
one that has been fixed.
        - Copy the kernel file to a floppy with:
dd if=/boot/bzImage-2.3.99-pre3 of=/dev/fd0
        If that floppy can successfully get further, that might point to a
problem with the boot loader, such as LILO or loadlin.
        - Check that you've picked the right processor (picking one that's
higher than what you've actually got in the system is likely to cause
problems, or at least cause it to run inefficiently). Try backing down to
a 386 and see if the problem goes away. If you enabled new features in
the kernel configuration right around the time they stopped building, try
removing those features and see if it works again.
        - Did you make any hardware changes around the time the kernel
stopped booting?
        - If it's a dual processor system (SMP), see if a UP (single
processor) kernel boots it correctly. That might point to a bug in the
SMP code.
        - Find out what the next line would be on a normal boot. That
might provide a clue as to what's having trouble.
        - Find the last successfully printed message in the kernel source
tree and see what happens after that's printed. Here's the brute force
search:

[root@sparrow /]# cd /usr/src/linux
[root@sparrow linux]# grep -i 'booting the kernel' */*.[chS] */*/*.[chS] */*/*/*.[chS] */*/*/*/*.[chS] */*/*/*/*/*.[chS]
arch/i386/boot/compressed/misc.c: puts("Ok, booting the kernel.\n");

        Looking in that file shows that this is the last message printed
in the decompress_kernel() routine. Now you take a look at what calls
decompress_kernel and see if there are any obvious problems in the calling
routine.

[root@sparrow linux]# grep -i 'decompress_kernel' */*.[chS] */*/*.[chS] */*/*/*.[chS] */*/*/*/*.[chS] */*/*/*/*/*.[chS]
arch/i386/boot/compressed/head.S: call SYMBOL_NAME(decompress_kernel)
arch/i386/boot/compressed/misc.c:int decompress_kernel(struct moveparams *mv)

        Now check head.S, and so on. Don't despair if you can't
understand the code; it may be that you can figure out what the problem is
by simply looking at the comments around the code in question.
        - Obviously _some_ previous kernel was able to boot that machine.
What was the last one that worked? Now, what was the first kernel that
_failed_ to boot the machine correctly? Lets say those were 2.3.42 and
2.3.48, respectively, and you never built the ones in between. Pick a
kernel halfway in the middle, build it and see if it boots that machine
correctly. If it does, find a kernel halfway between it and the first
kernel that fails to boot, otherwise find a kernel halfway between it and
the last kernel that successfully booted. It might take a few builds, but
sooner or later you'll find a pair of successive kernels, perhaps even two
prepatch levels, which respectively successfully boot the machine and fail
to boot the machine.
        Look at what changed between those two kernels (you can even use
the 'diff' tool to see the difference between two patches; try 'diff -bud
patch1 patch2'). Ignore all the code that's not relevant to you (code
from other architectures, drivers for devices you don't have and don't
compile in, etc.). Pay particular attention to code early in the boot
process, especially changes in the arch/i386/boot/compressed/misc.c and
head.S . What changes were made?

        Here's where the above work starts to pay off. Write up an email
to the linux-kernel mailing list (see the CC line if you've lost the
address). Using the format in /usr/src/linux/REPORTING-BUGS , share the
problem with the list. Include the results of all the above work you've
done, but make it as succinct and clear as possible. By this point you'll
be able to clearly state that somewhere between kernel 2.3.W-preX ad
2.3.Y-preZ, the system failed to boot.
        If your report includes large log files or a kernel configuration,
you might consider placing those on a web or ftp server somewhere and
placing a pointer to them in your summary.

        It might seem like a long process, but keep in mind it's a
worthwile contribution that will probably benefit others with the same
system or symptoms. By taking the time to track down the source of the
problem, you've freed up the main developers so they can focus on the
solution more quickly.

        Keep us posted. Here's hoping your knee heals quickly.
        Cheers,
        - Bill

---------------------------------------------------------------------------
        "``Threads are like salt. You like salt, I like salt, but we eat a
lot more pasta than salt.'' The thread guys are trying to tell you that
diet of salt is a good idea. They are wrong, don't listen, eat more
pasta and be happy."
        -- Larry McVoy <lm@bitmover.com>
--------------------------------------------------------------------------
William Stearns (wstearns@pobox.com). Mason, Buildkernel, named2hosts,
and ipfwadm2ipchains are at: http://www.pobox.com/~wstearns
LinuxMonth; articles for Linux Enthusiasts! http://www.linuxmonth.com
--------------------------------------------------------------------------

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Apr 30 2000 - 21:00:10 EST