Re: RFC: starting a kernel-testers group for newbies

From: Mark Lord
Date: Fri May 09 2008 - 12:33:21 EST


Pallipadi, Venkatesh wrote:


-----Original Message-----
From: linux-kernel-owner@xxxxxxxxxxxxxxx [mailto:linux-kernel-owner@xxxxxxxxxxxxxxx] On Behalf Of Carlos R. Mafra
Sent: Friday, May 02, 2008 10:16 AM
To: Linus Torvalds
Cc: Adrian Bunk; Paul Mackerras; Josh Boyer; Arjan van de Ven; Andrew Morton; Rafael J. Wysocki; davem@xxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; jirislaby@xxxxxxxxx; Steven Rostedt; Pallipadi, Venkatesh
Subject: Re: RFC: starting a kernel-testers group for newbies

On Fri 2.May'08 at 9:28:08 -0700, Linus Torvalds wrote:

Quite frankly, it does sound like the hang happens somewhere
around the
hpet_init
hpet_acpi_add
hpet_resources
hpet_resources: 0xfed00000 is busy

printk's you added (correct?) and we've had tons of issues
with NO_HZ, so
at a guess it is timer-related.
It happens a bit before that because when it hangs it doesn't print the above lines, and when it does not hang these lines are
the ones right after the point where it hangs.

(And I assume it's stable if/once it gets past that boot hang issue?
Yes you are right. When I have luck and the boot succeeds my Sony laptop
is rock solid and the kernel is wonderful (even the card reader works!).

That
tends to mean that it's not some hardware instability, it's
literally our
init code).
A few days ago I found this message in lkml in reply to a hpet patch
http://lkml.org/lkml/2007/5/7/361 in which the reporter also had a similar hang, which was cured by hpet=disable.

So it is in my TODO list to try to check out if that patch is in the current -git and whether it can be reverted somehow (I added Venki to the Cc: now)

Thanks a lot for the answer!

It depends on whether we are HPET is being force detected based on the
chipset or whether it was exported by the BIOS in ACPI table.

If it was force enabled and above patch is having any effect, then you
should see a message like
Force enabled HPET at base address 0xfed00000

In any case, off late there seems to be quite a few breakages that are
related to HPET/timer interrupts. One of them was on a system which has
HPET being exported by BIOS
http://bugzilla.kernel.org/show_bug.cgi?id=10409
And the other one where we are force enabling based on chipset
http://bugzilla.kernel.org/show_bug.cgi?id=10561

And then we have hangs once in a while reports by you, Roman and Mark
here
http://bugzilla.kernel.org/show_bug.cgi?id=10377
http://bugzilla.kernel.org/show_bug.cgi?id=10117
..

Yeah. This particular bug first appeared when NOHZ & HPET were added.
Somebody once suggested it had something to do with an SMI interrupt
happening in the midst of HPET calibration or some such thing.

But nobody who works on the HPET code has ever shown more than a casual
interest in helping to track down and fix whatever the problem is.

Cheers
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/