Re: [bug] SLOB crash, 2.6.24-rc2

From: Ingo Molnar
Date: Thu Nov 15 2007 - 06:29:19 EST



* Nick Piggin <nickpiggin@xxxxxxxxxxxx> wrote:

> On Thursday 15 November 2007 21:43, Ingo Molnar wrote:
> > * David Miller <davem@xxxxxxxxxxxxx> wrote:
> > > From: Matt Mackall <mpm@xxxxxxxxxxx>
> > > Date: Wed, 14 Nov 2007 17:37:13 -0600
> > >
> > > > No, the usual strategy for debugging problems -outside- SLOB is to
> > > > switch to another allocator with more extensive debugging facilities.
> > >
> > > Ok, so the thing we still can do is do a dump_stack() at the list
> > > debugging assertion trigger points.
> >
> > ok, i'll first try to trigger it again.
>
> I had implemented SLOB in userspace, so I resynched and think I found
> your problem. Sorry for the attachment format -- this mailer isn't the
> best. I'm really computer illiterate when it comes to userspace...

thx, i'll try your fix in a minute.

> Anyway, I'm really happy to see you're testing and using SLOB upstream
> :) Is there any particular reason that you're using it?

i sometimes test SLOB for -rt, but this time it's the result of my
"automated random QA" effort, as part of arch/x86 maintainance/QA.

the main trick is to build and booting random "make randconfig"
bzImages. That finds build bugs and a good deal of boot hang and crash
bugs as well. (it also found a compiler bug already) I can build and
boot about 1000 random kernels in 24 hours, and it's all fully
automated. I usually run it overnight - when a kernel does not come up
due to a bootup hang or crash (or the kernel log signals any exception
condition) then the script stops and i can fix it in the morning.

The first step towards this was to get allyesconfig bzImage kernels to
build and boot fine. That effort took months (we had many problems in
this area) - i think you saw bugreports and fixes from me about that on
lkml.

Once that worked reasonably well i made a small Kconfig patch that
forcibly selects a "minimum set" of drivers and kernel subsystems that
are needed to boot up a testsystem. Once a "make allnoconfig" and a
"make allyesconfig" bzImage kernel boots up fine on the testbox all
randconfig configs "inbetween" are supposed to build and boot fine as
well.

I also have a patch that adds all the x86 boot options like nosmp,
maxcpus=1, nohz=off, hpet=disable to be selectable as .config options -
so those boot options are randomized as well.

I also have a small patch that disables half a dozen drivers/features
that are not expected to work out of box in a bzImage kernel. (such as
ISA drivers that assume the presence of hardware, or root filesystem
features such as NFSROOT)

the resulting make randconfig kernel still has 99% of the degrees of
freedom that a stock make randconfig kernel has, so by all practical
purposes it's a fully random kernel - it just happens to boot on my
testsystem all the time.

A successful bootup means the test system is able to boot up into a
stock Fedora 8 userspace and is able to bring up its network interfaces
and ssh out (automatically) to the build box to signal the completion of
a successful test cycle. The logs are also analyzed for lockdep
assertions (if lockdep is enabled - which it is in about 20% of the
randconfig kernels) and other kernel bugs.

(just in case you were wondering about one of the reasons why the
arch/x86 unification merge went so smoothly, with nary a regression ;-)
Thomas is doing other types of automated QA of the x86 queue as well.)

this method found the SG-list corruption bugs the following night after
Linus committed Jen's SG-list changes, so it's pretty good at finding
regressions as early as possible.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/