Re: RFC: starting a kernel-testers group for newbies

From: Linus Torvalds
Date: Fri May 02 2008 - 12:29:11 EST




On Fri, 2 May 2008, Carlos R. Mafra wrote:
>
> So I would like to ask you what an user should do when facing what is
> probably a timing-related bug, as it appears I have the bad luck
> of hitting one.

Quite frankly, it will depend on the bug.

If it's *reliably* timing-related (which sounds crazy, but is not at all
unheard of), it can be reliably bisected down to some totally unrelated
commit that doesn't actually introduce the problem at all, but that
reliably turns it on or off.

That can be very misleading, and can cause us to basically revert a good
commit, only to not actually fix the bug (and possibly re-introduce the
bug that the reverted commit tried to fix).

But sometimes it gives us a clue where the timing problem is. But quite
frankly, that seems to be the exception rather than the rule.

There have been issues that literally seemed to depend on things like
cacheline placement etc, where changing config options for code that was
never actually even *run* would change timing just enough to show a bug
pseudo-reliably or not at all.

The good news is that those timing issues are really quite rare.

Tha bad news is that when they happen, they are almost totally
undebuggable.

> This same problem is still present with yesterday's git, and sometimes
> it hangs without hpet=disable and sometimes it doesn't. (And never
> with hpet=disable in the boot command line)

Hey, it may well be a HPET+NOHZ issue. But it could also be that HPET is
the thing that just allows you to see the hang.

> And using vga=6 or vga=0x0364 makes a difference in the probability
> of hanging.

.. and yeah, these kinds of really odd and obviously totally unrelated
issues are a sign of a bug that is either simply hardware instability or
very subtly timing-related.

The reason I mention hardware instability is that there really are bugs
that happen due to (for example) power supply instabilities. Brownouts
under heavy load have been causes of problems, but perhaps surprisingly,
so has _idle_ time thanks to sleep-states!

The latter is probably due to bad powr conditioning on the CPU power
lines, where the huge current swings (going at high CPU power to low, and
back again) not only have made soem motherboards "sing" (or "hum",
depending on frequency) but also causes voltage instability and then
the CPU crashes.

Am I saying that's the reason you see problems? Probably not. Most
instabilities really are due to kernel bugs. But hardware instabilities do
happen, and they can have these kinds of odd effects.

> I am just waiting -rc1 to be released to send an email with my
> problem again, as I am unable to debug this myself.
> I think this is ok from my part, right?

Yes. You've been a good bug reporter, and kept at it. It's not your fault
that the bug is hard to pin down.

Quite frankly, it does sound like the hang happens somewhere around the

hpet_init
hpet_acpi_add
hpet_resources
hpet_resources: 0xfed00000 is busy

printk's you added (correct?) and we've had tons of issues with NO_HZ, so
at a guess it is timer-related.

(And I assume it's stable if/once it gets past that boot hang issue? That
tends to mean that it's not some hardware instability, it's literally our
init code).

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/