Re: Simple script that locks up my box with recent kernels

From: Jesper Juhl
Date: Mon Oct 16 2006 - 18:46:10 EST


Ok, finally got to the end of the bisection (see below; quoting all of
my previous email since my concerns from that one are still valid).


On 09/10/06, Jesper Juhl <jesper.juhl@xxxxxxxxx> wrote:
Ok, some preliminary results on this before I go get some sleep + a
working day tomorrow...


On 07/10/06, Linus Torvalds <torvalds@xxxxxxxx> wrote:
>
>
> On Sat, 7 Oct 2006, Jesper Juhl wrote:
> >
> > > Can I bother you to just bisect it?
> >
> > Sure, but it will take a little while since building + booting +
> > starting the test + waiting for the lockup takes a fair bit of time
> > for each kernel
>
> Sure. That said, we've tried to narrow down things that took hours or days
> (under real loads, not some nice test-script) to reproduce, and while it
> doesn't always work, the real problem tends to be if the problem case
> isn't really reproducible. It sounds like yours is pretty clear-cut, and
> that will make things much easier.
>

Yeah, it seems pretty clear-cut, but I'm a bit nervous that it may
sometimes take longer than my observed 60min to reproduce, rendering
my git-bisection less than perfect (more on that below).


> > and also due to the fact that my git skills are pretty
> > limited, but I'll figure it out (need to improve those git skills
> > anyway) :-)
>
> "git bisect" in particular isn't that hard to use, and it will really do
> a lot of heavy lifting for you.
>
(...)
Thanks a lot for the tutorial, that really helped.

For some reason I couldn't get git to accept 2.6.17.13 as a "good"
starting point, so I used 2.6.17 instead, and the sha1 you gave me for
2.6.18-git15 as the "bad" starting point.

Here's where I am right now (a log of what I've done) :

[bisection start]

Bisecting: 5188 revisions left to test after this
[92164c5dd1ade33f4e90b72e407910de6694de49] USB: OHCI hub code unaligned access

[git bisect good]

Bisecting: 2567 revisions left to test after this
[e41542f5167d6b506607f8dd111fa0a3e468ccb8] [DCCP]: Introduce dccp_probe

[git bisect good]

Bisecting: 1351 revisions left to test after this
[b98adfccdf5f8dd34ae56a2d5adbe2c030bd4674] Merge
master.kernel.org:/pub/scm/linux/kernel/git/lethal/sh-2.6

[git bisect good]

Bisecting: 635 revisions left to test after this
[538d9d532b0e0320c9dd326a560b5a72d73f910d] irq: remove a extra line

[git bisect good]

Bisecting: 292 revisions left to test after this
[db1a19b38f3a85f475b4ad716c71be133d8ca48e] Merge branch
'intelfb-patches' of
master.kernel.org:/pub/scm/linux/kernel/git/airlied/intelfb-2.6

[git bisect bad]

Bisecting: 146 revisions left to test after this
[1db27c11e9a0c6d659040ac0b7c64a339e248fa1] istallion: Remove private
baud rate decoding, which is also broken in this case on some
platforms

[git bisect bad]

Bisecting: 73 revisions left to test after this
[3171a0305d62e6627a24bff35af4f997e4988a80] simplify update_times
(avoid jiffies/jiffies_64 aliasing problem)

[git bisect good]

Bisecting: 37 revisions left to test after this
[29b884921634e1e01cbd276e1c9b8fc07a7e4a90] set EXIT_DEAD state in
do_exit(), not in schedule()

[currently testing this kernel]


Looking at "git bisect visualize" the current status is this :

bisect/good: 3171a0305d62e6627a24bff35af4f997e4988a80
bisect/bad: 1db27c11e9a0c6d659040ac0b7c64a339e248fa1
Current bisect marker at: 29b884921634e1e01cbd276e1c9b8fc07a7e4a90


I'm a little worried though that my results may not be completely reliable.

There's no doubt that you can trust the kernels that I told git were
"bad" since those resultet in a hang and there's just no getting
around that. So we know for a fact that the bad commit is somewhere
between my last found bad kernel and 2.6.17, what we don't know with
the same amount of certainty is if the bad commit is between my last
found good kernel and the last found bad one.

What I'm worried about is the kernels I've marked as "good". Before
starting this run I had never experienced a hang if the kernel
survived past the one hour mark, so I concluded that testing each
kernel for 80min would be enough to prove it good or bad. This now
seems to be not completely reliable since my second bad kernel
happened to hang after ~2hrs. This happened since I forgot to check my
computer after 80min and only came back to it some 3hrs later (I know
the time it hung since I had a xterm doing while true;do sleep
10;uptime;done running, so I could check.

This all means that my testing and concluding kernels were "good"
after 80min of test runtime may not be 100% reliable.

Is it useful for me to continue bisecting from the point I'm at, or
should I reset from good==2.6.17 and bad==the_last_bad_commit_I_found
? Or do you have a likely culprit I should try revoking?

Whatever your answer it'll have to wait until tomorrow evening since
I'm going to go get some sleep now, but please let me know what you'd
like me to do ...


In the end, this is what git told me :

1db27c11e9a0c6d659040ac0b7c64a339e248fa1 is first bad commit
commit 1db27c11e9a0c6d659040ac0b7c64a339e248fa1
Author: Alan Cox <alan@xxxxxxxxxxxxxxxxxxx>
Date: Fri Sep 29 02:01:38 2006 -0700

[PATCH] istallion: Remove private baud rate decoding, which is
also broken in this case on some platforms

Signed-off-by: Alan Cox <alan@xxxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxx>
Signed-off-by: Linus Torvalds <torvalds@xxxxxxxx>

:040000 040000 0fc700de5e78b39acc130d529cf59437e9242b68
884b27574b6c38a5fa952d09ca945b167e36db84 M drivers


But, that doesn't make much sense, so I very strongly suspect that my
test case was not as reliable as I thought.
We can trust the commits I marked as 'bad' though since there's no
getting around a complete lockup of the box. So we know for sure now
that things broke between 2.6.17 and the commit above. But since that
commit makes no sense as the cause of the breakage it must be a case
of me having marked a kernel as 'good' that would eventually have
turned out bad if I'd run it longer :-(

Where do I go from here? The problem is still there... I'll test
2.6.19-rc2 tomorrow, but apart from that I don't know how to proceed
apart from trying to capture a sysrq+t dump when the box locks up...
any ideas?


--
Jesper Juhl <jesper.juhl@xxxxxxxxx>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/