Re: system gets stuck in a lock during boot

From: Justin Mattock
Date: Sun Oct 04 2009 - 20:12:28 EST


On Sun, Oct 4, 2009 at 10:41 AM, Ingo Molnar <mingo@xxxxxxx> wrote:
>
> * Jason Baron <jbaron@xxxxxxxxxx> wrote:
>
>> On Mon, Sep 07, 2009 at 02:49:44PM -0700, Justin Mattock wrote:
>> > >>
>> > >> * Justin P. Mattock<justinmattock@xxxxxxxxx>  wrote:
>> > >>
>> > >>
>> > >>>
>> > >>> Ingo Molnar wrote:
>> > >>>
>> > >>>>
>> > >>>> * Justin Mattock<justinmattock@xxxxxxxxx>   wrote:
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>>
>> > >>>>> O.K. I feel better, deleted
>> > >>>>> my system, and threw in a minimal built system
>> > >>>>> with only the bare essentials to boot.
>> > >>>>> (just to make sure things are correct).
>> > >>>>>
>> > >>>>> unfortunately after building rc6 I'm still hitting
>> > >>>>> this. really am not sure why this is happening.
>> > >>>>>
>> > >>>>>
>> > >>>>
>> > >>>> Could you please double-check the bisection result by doing this:
>> > >>>>
>> > >>>>   git revert af6af30c0f
>> > >>>>
>> > >>>> on the latest kernel and seeing whether that fixes the lockup?
>> > >>>>
>> > >>>> Bisections are very efficient and hence very sensitive as well to
>> > >>>> minimal errors. Just one small mistake near the end of a bisection
>> > >>>> can blame the wrong commit.
>> > >>>>
>> > >>>> So the best way to double-check such 100%-triggerable crashes is to
>> > >>>> do the revert. I tried the revert and it can be done fine here.
>> > >>>>
>> > >>>> [ _If_ that does not fix the bug then to save time you can
>> > >>>>     'backtrack' the bisection, instead of re-doing it completely.
>> > >>>>     I.e. you have your bisection log, re-check the final steps going
>> > >>>>     backwards. Once you find a discrepancy (i.e. a 'bad' point that
>> > >>>>     is 'good' or the other way around), redo the bisection log
>> > >>>>     commands up to that point and continue it up to the end. ]
>> > >>>>
>> > >>>>        Ingo
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>
>> > >>> shoot, I did not see your post here. when looking at my bisect
>> > >>> log, I guess after a git bisect reset it clears?
>> > >>>
>> > >>> Anyways after git bisect had finished I looked manually at the
>> > >>> commits that it had generated the one which I had sent in a post
>> > >>> previously, and this one:
>> > >>>
>> > >>>  9424edc2da097c8589fcc24a72552d33e54be161
>> > >>>
>> > >>
>> > >> (this commit has no effect on your kernel image, at all.)
>> > >>
>> > >>
>> > >
>> > > yep. but it was worth a try.
>> > >>>
>> > >>> at the time looking at the commit, I see this to be more of the
>> > >>> cause because of it being related to elf as so forth, but as soon
>> > >>> as I reverted this on rc6 made no difference.(the previous commit
>> > >>> fixes this for me, on a regular tar.ball as well as in git.
>> > >>>
>> > >>> I think at this point since this system is a fresh from scratch
>> > >>> build, I think something might be wrong that I'm doing (all the
>> > >>> CFLAGS, and such are in a previous post).
>> > >>>
>> > >>> At the moment I don't have a problem applying a patch to the
>> > >>> kernel for this. especially since I'm the only one that seems to
>> > >>> be hitting this, then if more and more reports of this happen then
>> > >>> we can go from there.
>> > >>>
>> > >>
>> > >> What would be nice is to verify your bisection end result, i.e. do
>> > >> what i suggested:
>> > >>
>> > >>
>> > >
>> > > yeah I've done this on both kernels three to be exact, and all boot after
>> > > reverting
>> > > Fix perf-tracepoint OOPS.
>> > >
>> > > As for my system, I'm still convinced that I might be doing something wrong
>> > > over here.
>> > >
>> > >>>> Could you please double-check the bisection result by doing this:
>> > >>>>
>> > >>>>   git revert af6af30c0f
>> > >>>>
>> > >>>> on the latest kernel and seeing whether that fixes the lockup?
>> > >>>>
>> > >>
>> > >> if this doesnt fix it on latest -git then this commit is not the
>> > >> cause of the lockup.
>> > >>
>> > >>        Ingo
>> > >>
>> > >>
>> > >
>> > > This commit(Fix perf-tracepoint OOPS.)does fix my stuckage, but I'm left, as
>> > > well as others asking
>> > > the question of why.
>> > > In any case I still think I'm setting something wrong with either gcc, or
>> > > something
>> > > that might be causing this from userland.
>> > >
>> > > Justin P. Mattock
>> > >
>> >
>> > O.k. here something awkward about this issue I was
>> > experiencing. at the moment I have two imac's
>> > here the descriptions:
>> >
>> > imac A) the one with the problem
>> >
>> > OS: built from the clfs book
>> > x86_64 multilib with only lib64
>> >
>> > built everything with these flags:
>> > CFLAGS="-m64 -mtune=core2 -march=core2
>> > -mfpmath=both -O2 -pipe -fomit-frame-pointer
>> > -fstack-protection"
>> > CXXFLAGS="${CFLAGS}" MAKEOPTS="{-j3}"
>> > while compiling everything with
>> > gcc version: 4.5.0 20090730
>> >
>> >
>> > imac B) the one that works
>> >
>> > OS: clfs(just built a few days ago)
>> > x86_64 pure64 bit build
>> > (lib with a symlink to lib64)
>> > CFLAGS="-m64 -mtune=core2 -march=core2
>> >  -O2 -pipe -fomit-frame-pointer"
>> > CXXFLAGS="${CFLAGS}" MAKEOPTS="{-j3}"
>> > gcc version: 4.4.1 (GCC for Cross-LFS 4.4.1.20090722)
>> >
>> > The only things I can think of is either I hit something
>> > because of gcc, something goes wrong with the libraries,
>> > or there something happening with either the option
>> > of mfpmath=both or stackprotection.
>> >
>> > At this point since the kernel seems to be running fine,
>> > is to just trash the system that has this issue and just leave
>> > it at, I was hitting some weird anomaly.
>> >
>>
>> hi Justin,
>>
>> I've been playing around with gcc '4.5' as well and hit a panic that
>> looks very similar to what you've seen with stock 2.6.31 - I haven't
>> seen it anywhere else. Anyways, it seems to be some sort of alignment
>> issue with the 'struct ftrace_event_call'. I'm not sure yet if this is a
>> compiler or kernel issue. But the following kernel patch fixes the issue
>> for me. It would be interesting to verify if the patch also resolves the
>> issue for you.
>
> Would be nice to know precisely what kind of problem is being hit here -
> we'd like to fix either the kernel or GCC - depending on where the bug
> lies.
>
>        Ingo
>

So I wasn't going crazy....
Anyways that system(clfs)
I still have, I can go ahead and
put it back on the machine and see if I hit this
again(keep in mind, just got back from a 7hr drive,
so it might be tomorrow).

--
Justin P. Mattock
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/