Re: [PATCH v4] lib/spinlock_debug.c: prevent a recursive cycle in the debug code

From: Sergey Senozhatsky
Date: Thu Jan 28 2016 - 19:53:45 EST


On (01/29/16 08:54), Byungchul Park wrote:
> > The problem is you have postulated a very shallow recursion.
> > This looks much worse if this happens 1000 times, and
> > probably won't recover to output anything.
> >
> > Additionally, what if the console_sem is simply corrupted?
> > A livelock with no output ever is not very helpful.
> >
> > As I wrote earlier, I don't think this is the way to fix
> > recursion problems with printk() [by eliding output].
>
> I think you are currently misunderstading about this patch. Or I'm
> misunderstanding you.. The patch was changed in v4 so that it can print
> a debug message even in the recursive cycle case, at the first time.
>
> I also thought printing nothing in the case was not helpful at all which I
> did in v1,2,3. But it's changed in v4, that is, this patch.

because you don't give any details and don't answer any questions.
it took a while to even find out that you are reporting this issues
not against a real H/W, but a qemu. I suppose qemu-arm running on
x86_64 box.

now, what else we don't know?

explain STEP-BY-STEP why do you think spinlock debug code can lockup
itself. not just "I don't think this is the case, I don't think that
is the case".

on very spin_dump recursive call it waits for the spin_lock and when
it eventually grabs it, it does the job that it wanted to do under
that spin lock, unlock it and return back. and the only case when it
never "return back" is when it never "eventually grabs it".

so I still don't see what issue you fix here -- the possibility to
consume the entire kernel stack doing recursive spin_dump->spin_lock()
calls because:
a) something never unlocks the lock (no matter why.. corruption, HW
fault, etc.)
or
b) everything was OK, but we attempted to printk() already
being in a very-very deep callstack, so doing 5 extra
printk->spin_dump->printk->spin_dump would simply kill it.


if none of the above. then what you report and fix is simply non
realistic. spin_dump must eventually unwind the stack back. yes,
you'll see a lot of dump_stack() and all cpus backtraces done on
every roollback stack. but you would still see some of them anyway,
even w/o the spinlock debug code -- because you'd just
raw_spin_lock_irqsave() on that lock for a very long time; which
does upset watchdog, etc.


please start explaining the things.

-ss