Re: BUG() in 2.6.28-rc8-git2 under heavy load

From: Andi Kleen
Date: Mon Dec 22 2008 - 19:18:25 EST


> 1. The CPU reported the wrong faulting instruction (seems highly

I remember spending quite some time on a report a few years ago
and in the end decided the CPU in that case was reporting incorrect
fault addresses too. iirc we blamed it on overheating or some
unspecified hardware damage.

> unlikely, since that means it wouldn't be able to resume properly in
> other situations),
> 2. We really were executing at a slightly strange (offset) EIP
>
> I'm going for #2. But how could it happen? Did the caller supply a
> wrong address in its CALL? It seems unlikely. Why would it happen only
> for this function, four times in a row, at the exact same location?
> Was the caller's code corrupted?

There are a couple of situations: someone corrupted a pointer
on the stack or in a structure containing function pointers.

On x86-64 there's another trap that if you call a function
that is declared stdargs ... through a prototype that doesn't
contain ... it can also jump to random addresses due to the
way gcc handles stdargs. Normally we have very few stdargs
functions in the kernel so it's unlikely, but I've seen
the problem in userland.

If it's reproducible one way to trace it down would be to enable
LBR (I got some old patches for that that could be adapted), but then
that would only tell you the caller.

-Andi
--
ak@xxxxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/