A couple of server boxes I use for testing kept hanging. So in the end
I put a keyboard/mouse on the hanging boxes and bashed alt-scrolllock and
co a lot once it stopped.
> And in this case, we actually _do_ have a wait-queue we're waiting on, so
> this isn't even one of those "wait a while" cases - we're actually really
> going to sleep. So this one is a red herring.
Ok. Yes because we must either wait or succeed.
> > 2. __wait_on_page says you are supposed to own the page before waiting
> > on it (ie up page->count). I don't see where we do that. We even
> > go back to the start to allow for the pages on the inode changing
> > if we sleep but we don't up the count ourselves.
>
> This may be the real offender - I think you found a real bug, and it looks
> extremely hard to trigger, but it looks like it has been there from the
> very first version.
With 2.2.7 I've never seen it, with 2.2.9 and 2.2.9ac1 I see a hang every 24-36
hours. Both my test boxes hung on the same thing.
> atomic_inc(&page->count);
> ..
> atomic_dec(&page->count);
>
> around that part should do the trick. What is you test load?
On the one box I was running a pile of assorted network and disk hogs
- repeatedly tarring the fs to /dev/null, find / etc
On the other it was basically an idle 6.0 box.
I'll rebuild with this changed and run some more tests
Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/