Re: [PATCH v4] lib/spinlock_debug.c: prevent a recursive cycle in the debug code

From: Byungchul Park
Date: Thu Jan 28 2016 - 22:01:07 EST

On Fri, Jan 29, 2016 at 09:54:06AM +0900, Sergey Senozhatsky wrote:
> because you don't give any details and don't answer any questions.

There are 2 ways to make the kernel better and stabler.

1) Remove the possiblity which make the system go crazy, even though it
would hardly happen since the possiblity is too low.

2) Fix it after facing some problems in practice and debugging it.

I started to write this patch due to the 2nd reason after seeing the
backtrace in gdb. But I lost the data with which I can debug it now,
since I was mis-convinced that it was done. So I could not answer it for
the your questions about memory corruption and cpu off. Sorry for not
informing you these facts in advance. But please remind that I was in
progress by the 1st way.

> it took a while to even find out that you are reporting this issues
> not against a real H/W, but a qemu. I suppose qemu-arm running on
> x86_64 box.

No matter what kind of box I used because I only kept talking about the
possiblity. It does not depend on a box at all.

> now, what else we don't know?
> explain STEP-BY-STEP why do you think spinlock debug code can lockup
> itself. not just "I don't think this is the case, I don't think that
> is the case".

I did explaining the reason in detail even though there's something I
missed. I've never said "I don't think this is the case" on the
description explaining the problem. Anyway, I am not sure about my patch
now, thank to your advice.

> on very spin_dump recursive call it waits for the spin_lock and when
> it eventually grabs it, it does the job that it wanted to do under
> that spin lock, unlock it and return back. and the only case when it
> never "return back" is when it never "eventually grabs it".

Right. I missed it.

> so I still don't see what issue you fix here -- the possibility to
> consume the entire kernel stack doing recursive spin_dump->spin_lock()
> calls because:
> a) something never unlocks the lock (no matter why.. corruption, HW
> fault, etc.)
> or
> b) everything was OK, but we attempted to printk() already
> being in a very-very deep callstack, so doing 5 extra
> printk->spin_dump->printk->spin_dump would simply kill it.
> if none of the above. then what you report and fix is simply non
> realistic. spin_dump must eventually unwind the stack back. yes,
> you'll see a lot of dump_stack() and all cpus backtraces done on
> every roollback stack. but you would still see some of them anyway,
> even w/o the spinlock debug code -- because you'd just
> raw_spin_lock_irqsave() on that lock for a very long time; which
> does upset watchdog, etc.

I am not sure now, if it can be fixed by the 1st way, that is, removing
the possiblity which make the system go crazy. There's something I missed.
Now I have to solve this problem by the 2nd way after reproducing it and
debugging it in detail. I still keep trying to reproduce it now.

Anyway. Thank you very much.


> please start explaining the things.
> -ss