Re: Linux 2.2.14 Sparc-32 SS-10 SMP spin_lock hang...

From: Anton Blanchard (anton@progsoc.uts.edu.au)
Date: Sat Jan 08 2000 - 22:28:06 EST


 
> The meantime between crashes with 2.2.13 on the SS-10 configured as above
> running Apache 1.3.3p3 was about four hours, with 2.2.14 it's about a day and a
> half. Here is the crash data for the 2.2.14 crash, on the console we get:
>
> spin_lock_irqsave(f808e66c) CPU #2 stuck at f00372c, owner PC(f003672c):CPU(2)
> spin_lock(f014b9ac) CPU #0 stuck at f003bde0, owner PC(f0036538):CPU(2)
> spin_lock(f014b9ac) CPU #1 stuck at f003bde0, owner PC(f0036538):CPU(2)
> spin_lock(f014b9ac) CPU #3 stuck at f003bde0, owner PC(f0036538):CPU(2)

CPU #0, #1 and #3 are stuck because CPU #2 grabbed the kernel_flag in
do_exit so we can ignore them.

The interesting line is:

spin_lock_irqsave(f808e66c) CPU #2 stuck at f00372c, owner PC(f003672c):CPU(2)

(I assume f00372c should be f003672c and is just written down incorrectly).

>From a quick guess, f808e66c points to a sigmask_lock and CPU #2 is stuck
in __exit_sighand (called from do_exit):

static inline void __exit_sighand(struct task_struct *tsk)
{
        struct signal_struct * sig = tsk->sig;
        unsigned long flags;

        spin_lock_irqsave(&tsk->sigmask_lock, flags);
        if (sig) {
                tsk->sig = NULL;
                if (atomic_dec_and_test(&sig->count))
                        kfree(sig);
        }

        flush_signals(tsk);
        spin_unlock_irqrestore(&tsk->sigmask_lock, flags);
}

The debug line says that CPU #2 has already obtained the sigmask_lock and is
trying to again and in the same spot! Sounds suspicious, I'll give the
sparc spinlock code a once over.

Anton

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sat Jan 15 2000 - 21:00:13 EST