Re: [bug] SLOB crash, 2.6.24-rc2

From: Dave Haywood
Date: Thu Nov 15 2007 - 07:40:30 EST


Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@xxxxxxxxxxxx> wrote:
>
>
>> On Thursday 15 November 2007 21:43, Ingo Molnar wrote:
>>
>>> * David Miller <davem@xxxxxxxxxxxxx> wrote:
>>>
>>>> From: Matt Mackall <mpm@xxxxxxxxxxx>
>>>> Date: Wed, 14 Nov 2007 17:37:13 -0600
>>>>
>>>>
>>>>> No, the usual strategy for debugging problems -outside- SLOB is to
>>>>> switch to another allocator with more extensive debugging facilities.
>>>>>
>>>> Ok, so the thing we still can do is do a dump_stack() at the list
>>>> debugging assertion trigger points.
>>>>
>>> ok, i'll first try to trigger it again.
>>>
>> I had implemented SLOB in userspace, so I resynched and think I found
>> your problem. Sorry for the attachment format -- this mailer isn't the
>> best. I'm really computer illiterate when it comes to userspace...
>>
>
> thx, i'll try your fix in a minute.
>
>
>> Anyway, I'm really happy to see you're testing and using SLOB upstream
>> :) Is there any particular reason that you're using it?
>>
>
> i sometimes test SLOB for -rt, but this time it's the result of my
> "automated random QA" effort, as part of arch/x86 maintainance/QA.
>
> the main trick is to build and booting random "make randconfig"
> bzImages. That finds build bugs and a good deal of boot hang and crash
> bugs as well. (it also found a compiler bug already) I can build and
> boot about 1000 random kernels in 24 hours, and it's all fully
> automated. I usually run it overnight - when a kernel does not come up
> due to a bootup hang or crash (or the kernel log signals any exception
> condition) then the script stops and i can fix it in the morning.
>
> The first step towards this was to get allyesconfig bzImage kernels to
> build and boot fine. That effort took months (we had many problems in
> this area) - i think you saw bugreports and fixes from me about that on
> lkml.
>
> Once that worked reasonably well i made a small Kconfig patch that
> forcibly selects a "minimum set" of drivers and kernel subsystems that
> are needed to boot up a testsystem. Once a "make allnoconfig" and a
> "make allyesconfig" bzImage kernel boots up fine on the testbox all
> randconfig configs "inbetween" are supposed to build and boot fine as
> well.
>
> I also have a patch that adds all the x86 boot options like nosmp,
> maxcpus=1, nohz=off, hpet=disable to be selectable as .config options -
> so those boot options are randomized as well.
>
> I also have a small patch that disables half a dozen drivers/features
> that are not expected to work out of box in a bzImage kernel. (such as
> ISA drivers that assume the presence of hardware, or root filesystem
> features such as NFSROOT)
>
> the resulting make randconfig kernel still has 99% of the degrees of
> freedom that a stock make randconfig kernel has, so by all practical
> purposes it's a fully random kernel - it just happens to boot on my
> testsystem all the time.
>
> A successful bootup means the test system is able to boot up into a
> stock Fedora 8 userspace and is able to bring up its network interfaces
> and ssh out (automatically) to the build box to signal the completion of
> a successful test cycle. The logs are also analyzed for lockdep
> assertions (if lockdep is enabled - which it is in about 20% of the
> randconfig kernels) and other kernel bugs.
>
> (just in case you were wondering about one of the reasons why the
> arch/x86 unification merge went so smoothly, with nary a regression ;-)
> Thomas is doing other types of automated QA of the x86 queue as well.)
>
> this method found the SG-list corruption bugs the following night after
> Linus committed Jen's SG-list changes, so it's pretty good at finding
> regressions as early as possible.
>
> Ingo
>

How complete is the QA testing? I was reading this interesting thread
and it occurred to me that this sounds like a useful distributed
computing application. ie a central server with all valid Kconfig
combinations (how many are there?) for a particular release (-rc or
otherwise) across all architectures. These are allocated to clients on
request to be built / booted etc. Any errors are fed back to the
central server. I guess this would be a useful resource for
developers. More importantly (and I don't know if this is the case
already!) a new Linux release (2.6.x) could be "certified" with some
level of testing on known hardware / architectures.

tbh, I feel sorry for Ingo's machine compiling 1000 random kernels in
24h! I'm surprised it hasn't called the Samaritans...

Dave.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/