RE: [PATCH v4] lib/spinlock_debug.c: prevent a recursive cycle in the debug code

From: byungchul.park
Date: Wed Jan 27 2016 - 23:52:22 EST


> From: Sergey Senozhatsky [mailto:sergey.senozhatsky.work@xxxxxxxxx]
> Sent: Thursday, January 28, 2016 11:38 AM
> To: Byungchul Park
> Cc: akpm@xxxxxxxxxxxxxxxxxxxx; mingo@xxxxxxxxxx; linux-
> kernel@xxxxxxxxxxxxxxx; akinobu.mita@xxxxxxxxx; jack@xxxxxxx;
> torvalds@xxxxxxxxxxxxxxxxxxxx; peter@xxxxxxxxxxxxxxxxxx;
> sergey.senozhatsky.work@xxxxxxxxx; sergey.senozhatsky@xxxxxxxxx
> Subject: Re: [PATCH v4] lib/spinlock_debug.c: prevent a recursive cycle in
> the debug code
>
> ok, I'll repeat the questions.
>
> under what circumstances you hit this problem? ...memory corruption, cpu
> core has been powered off, while it owned the spin_lock... your irqsave
> didn't work?

I think these are not the my case.

>
> the thing is, it's really-really hard to lockup in console_trylock()...
>
> int down_trylock(struct semaphore *sem)
> {
> unsigned long flags;
> int count;
>
> raw_spin_lock_irqsave(&sem->lock, flags); <<<<<< um...

I also think it's hard, but a backtrace said the lockup happened here.

> count = sem->count - 1;
> if (likely(count >= 0))
> sem->count = count;
> raw_spin_unlock_irqrestore(&sem->lock, flags);
>
> return (count < 0);
> }
>
> was not able to dereference sem->count? `*sem' was corrupted? or because
> sem->lock was corrupted? in any of those cases you solve the problem from
> the wrong side. if you have a memory corruption then it can corrupt

What I solved here is to make it possible to print what the problem is, by
the spinlock debug code instead of system lockup while printing a debug
message. I think it's not wrong.

> anything,
> including, for example, console driver spin_lock.
>
>
> suppose, that you hit do_raw_spin_lock()->spin_dump(), which means that
> the
> spin_lock was not corrupted, it passed debug_spin_lock_before() after all.
> and that spin_lock was taken for longer than `loops_per_jiffy * HZ', while
> other CPU was doing
>
> count = sem->count - 1;
> if (likely(count >= 0))
> sem->count = count;
>
> ???
>
> was the CPU that owned the lock alive? (h/w fault, perhaps?).

I am just curious.. Is it impossible but by h/w fault? e.g. timing to
allocate
virtual cpus to a guest machine when using virtual machine and so on.

>
>
> dunno... yes, this
> printk()->console_trylock()->do_raw_spin_lock()->spin_dump()-
> >printk()
> is possible, but it's possible only when your system is screwed up badly,
> so
> badly that this spin_dump() loop is not really a problem, IMHO.

IMHO, even though a system is screwed up badly, the spinlock debug code have
to print the information about the problem without lockup.

>
> if I'm missing something terribly obvious here, then please give more
> details.

There are already codes to prevent the recursive cycle in the bug case,
but not for the just suspected case. I just made it possible for the
latter case. That's all this patch is doing, now. :)

thanks,
byungchul

>
> -ss
>
> > > Signed-off-by: Byungchul Park <byungchul.park@xxxxxxx>
> > > ---
> > > include/linux/debug_locks.h | 4 ++++
> > > kernel/locking/spinlock_debug.c | 14 +++++++++++---
> > > 2 files changed, 15 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/include/linux/debug_locks.h b/include/linux/debug_locks.h
> > > index 822c135..b0f977e 100644
> > > --- a/include/linux/debug_locks.h
> > > +++ b/include/linux/debug_locks.h
> > > @@ -10,6 +10,10 @@ struct task_struct;
> > > extern int debug_locks;
> > > extern int debug_locks_silent;
> > >
> > > +static inline void __debug_locks_on(void)
> > > +{
> > > + debug_locks = 1;
> > > +}
> > >
> > > static inline int __debug_locks_off(void)
> > > {
> > > diff --git a/kernel/locking/spinlock_debug.c
> b/kernel/locking/spinlock_debug.c
> > > index 0374a59..65177ba 100644
> > > --- a/kernel/locking/spinlock_debug.c
> > > +++ b/kernel/locking/spinlock_debug.c
> > > @@ -113,11 +113,19 @@ static void __spin_lock_debug(raw_spinlock_t
> *lock)
> > > return;
> > > __delay(1);
> > > }
> > > - /* lockup suspected: */
> > > - spin_dump(lock, "lockup suspected");
> > > +
> > > + /*
> > > + * We should prevent calling printk() further, since it would cause
> > > + * an infinite recursive cycle if it's called from printk()!
> > > + */
> > > + if (__debug_locks_off()) {
> > > + /* lockup suspected: */
> > > + spin_dump(lock, "lockup suspected");
> > > #ifdef CONFIG_SMP
> > > - trigger_all_cpu_backtrace();
> > > + trigger_all_cpu_backtrace();
> > > #endif
> > > + __debug_locks_on();
> > > + }
> > >
> > > /*
> > > * The trylock above was causing a livelock. Give the lower level
> arch
> > > --
> > > 1.9.1
> >