Re: [PATCH v4] lib/spinlock_debug.c: prevent a recursive cycle in the debug code

From: Sergey Senozhatsky
Date: Thu Jan 28 2016 - 05:40:34 EST


On (01/28/16 17:13), Byungchul Park wrote:
[..]
> > > > int down_trylock(struct semaphore *sem)
> > > > {
> > > > unsigned long flags;
> > > > int count;
> > > >
> > > > raw_spin_lock_irqsave(&sem->lock, flags); <<<<<< um...
> > >
> > > I also think it's hard, but a backtrace said the lockup happened here.
> >
> > what was the state of `struct semaphore *sem' and especially of `sem->lock'?
> > what was the lock->owner_cpu doing? (addr2line of its pc registe, for example).
>
> Unfortunately, it's not reproduced anymore.
>
> If it's clearly a spinlock caller's bug as you said, modifying the
> spinlock debug code does not help it at all. But I found there's a
> possiblity in the debug code *itself* to cause a lockup. So I tried
> to fix it. What do you think about it?

ah... silly me... you mean the first CPU that triggers the spin_dump() will
deadlock itself, so the rest of CPUs will see endless recursive
spin_lock()->spin_dump()->spin_lock()->spin_dump() calls?

like the one below?


CPUZ is doing vprintk_emit()->spin_lock(), CPUA is the spin_lock's owner

CPUZ -> vprintk_emit()
__spin_lock_debug()
for (i = 0; i < `loops_per_jiffy * HZ'; i++) { << wait for the lock
if (arch_spin_trylock())
return;
__delay(1);
}
spin_dump() << lock is still owned by CPUA
{ -> vprintk_emit()
__spin_lock_debug()
for (...) {
if (arch_spin_trylock())
return;
__delay(1);
}
<< CPUA unlocked the lock
spin_dump()
{ -> vprintk_emit()
__spin_lock_debug()
for (...) {
if (arch_spin_trylock()) << success!!
/* CPUZ now owns the lock */
return;
}
}

<< we return here with the spin_lock being owned by this CPUZ

trigger_all_cpu_backtrace()

<< and... now it does the arch_spin_lock()
/*
* The trylock above was causing a livelock. Give the lower level arch
* specific lock code a chance to acquire the lock. We have already
* printed a warning/backtrace at this point. The non-debug arch
* specific code might actually succeed in acquiring the lock. If it is
* not successful, the end-result is the same - there is no forward
* progress.
*/
arch_spin_lock(&lock->raw_lock);

<< which obviously dealocks this CPU...
}

trigger_all_cpu_backtrace()

arch_spin_lock()




so
"the CPUZ is now keeping the lock forever, and not going to release it"
and
"CPUA-CPUX will do vprintk_emit()->spin_lock()->spin_dump()->vprintk_emit()->..."



My apologies for not getting it right the first time. Sorry!

Can you please update your bug description in the commit message?
It's the deadlock that is causing the recursion on other CPUs in the
first place.


-ss