Re: Oops in sched.c on PPro SMP

From: Peter Waechtler (pwaechtler@mac.com)
Date: Tue Sep 17 2002 - 12:18:19 EST


Am Dienstag den, 17. September 2002, um 01:13, schrieb Andrea Arcangeli:

> On Mon, Sep 16, 2002 at 11:16:20PM +0200, Peter Waechtler wrote:
>> Am Montag den, 16. September 2002, um 17:44, schrieb Andrea Arcangeli:
>>
>>> On Mon, Sep 16, 2002 at 03:49:27PM +0100, Alan Cox wrote:
>>>> Also does turning off the nmi watchdog junk make the box stable ?
>>>
>>> good idea, I didn't though about this one since I only heard the nmi
>>> to
>>> lockup hard boxes after hours of load, never to generate any
>>> malfunction, but certainly the nmi handling isn't probably one of the
>>> most exercised hardware paths in the cpus, so it's a good idea to
>>> reproduce with it turned off (OTOH I guess you probably turned it on
>>> explicitly only after you got these troubles, in order to debug them).
>>>
>>
>> I only turned the nmi watchdog on, on the one "unknown" version Oops.
>>
>> This box was running fine with 2.4.18-SuSE with uptimes 40+days. _Now_
>> I am almost sure, that it's _not_ a hardware problem (FENCE counting
>> here as software - since there is a software workaround).
>>
>> I had 3 lockups in 2 days, when I switched to 2.4.19 - and even lower
>> room temperature. No, there _must_ be a bug :)
>
> possible. Which was the previous kernel running in the machine before
> 2.4.18-SuSE?
>

I guess 2.4.10-SuSE from 7.3
In january I switched to 2.4.14 and 2.4.17 and applied the xfs patches.
At exactly the same instruction: kaboom!

http://marc.theaimsgroup.com/?l=linux-kernel&m=101113532211430&w=2

>> Can someone explain me the difference for label 1 and 2?
>> Why is the "js 2f" there? This I don't understand fully -
>> it looks broken to me.
>
> it's correct, if not we would have noticed since a long time ;)
>
> What it does is to subtract 1 to the lock, if it goes negative (signed)
> it jumps into the looping slow path (label 2), then when it finally
> stops looping because it acquired the lock, it jumps back to 1 and
> enters the critical section. The slow path takes care of acquiring the
> lock internally, first polling and doing without requiring the cacheline
> exclusive the trylock again.

After studying the disassembly I now see the "trick" with a jump
to a new section.

>>
>> include/asm-i386/rwlock.h
>>
>> #define __build_read_lock_ptr(rw, helper) \
>> asm volatile(LOCK "subl $1,(%0)\n\t" \
>> "js 2f\n" \
>> "1:\n" \
>> LOCK_SECTION_START("") \
>> "2:\tcall " helper "\n\t" \
>> "jmp 1b\n" \
>> LOCK_SECTION_END \
>> ::"a" (rw) : "memory")

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Mon Sep 23 2002 - 22:00:19 EST