Re: VM-related Oops: 2.4.15pre1

From: Simon Kirby (sim@netnation.com)
Date: Mon Nov 19 2001 - 13:31:56 EST


On Mon, Nov 19, 2001 at 10:03:34AM -0800, Linus Torvalds wrote:

> I suspect that your earlier oopses left something in a stale state - this
> is the same machine that you've reported others oopses for, no?

I have in the past reported one Oops for this machine, I think, yes.
I think it was explained by previous kernel bugs (it was running 2.4.12).
On this kernel version, we've only seen the single BUG() message
regarding page->mapping, and the associated forced Oops/backtrace thing.
Every BUG() and backtrace has been the same except for a few registers,
including the first backtrace.

> It looks like it's a bog-standard page, that was just free'd (probably
> because of page->count corruption) while it was still in the page cache.
> Now, how that page->count corruption actually happened, I have no idea,
> which is why I suspect you had other earlier oopses that left the machine
> in an inconsistent state.
>
> There _is_ a known race in 2.4.15pre1, where we simply test a condition
> that isn't true any more and that can cause spurious oopses (not this one,
> though) under the right circumstances. Such an oops might have left
> something in the VM in a half-way state...

Right, but there were no other Oopses on this machine since 2.4.15pre1
was put on (up 5 days). Previously, with 2.4.12, it Oopsed in some
memory freeing function (I think it was __free_pages_ok or something),
but I didn't have a serial console on it at the time and it was locked
up.

> Can you reproduce this on pre6, for example? And if so, what's the load?

Will pre6 eat my filesystem? :) It's a production box (I just used pre1
because I read through it and saw there were no serious changes, just
a simple race fix and a few other things).

The box is a heavily-hit shared hosting web server running the usual
collection of Apache, Perl, php, and other programs.

I wonder if the quota stuff (which is also used heavily on this box, but
probably not tested anywhere near as widely elsewhere) is the culprit.
Jan Kara has sent me this patch to test (but I have not yet had the
chance to try it on some production servers). It looks like he's wrapped
some memory freeing functions with lock_kernel. Currently, 2.4.14 Oopses
all over the place if quotacheck is run on an active filesystem with
quota turned on (which is broken to begin with, yes, but shouldn't cause
Oopses).

Attached is Jan's patch. See anything interesting? He said he has not
yet submitted it because he hasn't had a chance to test it on an SMP box.

Simon-

[ Stormix Technologies Inc. ][ NetNation Communications Inc. ]
[ sim@stormix.com ][ sim@netnation.com ]
[ Opinions expressed are not necessarily those of my employers. ]



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Fri Nov 23 2001 - 21:00:20 EST