Re: [PATCH v4] lib/spinlock_debug.c: prevent a recursive cycle in the debug code

From: Peter Hurley
Date: Thu Jan 28 2016 - 23:32:25 EST


On 01/28/2016 04:27 PM, Sergey Senozhatsky wrote:
> On (01/28/16 15:08), Peter Hurley wrote:
> [..]
>>> even if at some level of recursion (nested printk calls)
>>> spin_dump()->__spin_lock_debug()->arch_spin_trylock() acquires the
>>> lock, it returns back with the spin lock unlocked anyway.
>>>
>>> vprintk_emit()
>>> console_trylock()
>>> spin_lock()
>>> spin_dump()
>>> vprintk_emit()
>>> console_trylock()
>>> spin_lock()
>>> spin_dump()
>>> vprintk_emit()
>>> console_trylock()
>>> spin_lock() << OK, got the lock finally
>>
>> The problem is you have postulated a very shallow recursion.
>> This looks much worse if this happens 1000 times, and
>> probably won't recover to output anything.
>
> well, the stack is surely limited, but on every
> spin_dump()->spin_lock() recursive call it does another
> round of
>
> u64 loops = loops_per_jiffy * HZ;
>
> for (i = 0; i < loops; i++) {
> if (arch_spin_trylock(&lock->raw_lock))
> return;
> __delay(1);
> }
>
> so if you have 1000 spin_dump()->spin_lock() then, well,
> something has been holding the lock for '1000 * loops_per_jiffy * HZ'.
>
> and in particularly this case that somethign was holding the
> spin lock doing trivial operations like
>
> count = sem->count - 1;
> if (likely(count >= 0))
> sem->count = count;
>
> (or a bit more if it was in down()). but still.
>
> and it's kinda hard to imagine console_sem lock being soooooooo
> congested and unfair. on each given point of time in the worst
> case there are `num_online_cpus() - 1' cpus spinning on that spin_lock
> and 1 cpu holding that spinlock. which in Byungchul's case is, what,
> 3 spinning cpus, or 7 spinnign cpus?...


You're assuming that Byungchul's patch is relevant to the recursion
he witnessed. There are several paths into spin_dump().

Here's one that doesn't wait at all:

vprintk_emit()
console_trylock()
down_trylock()
raw_spin_lock_irqsave()
...
do_raw_spin_lock()
debug_spin_lock_before()
SPIN_BUG_ON()
spin_bug()
spin_dump()
printk()
** RINSE AND REPEAT **



>> Additionally, what if the console_sem is simply corrupted?
>> A livelock with no output ever is not very helpful.
>
> if it's corrupted then this is not a spinlock debug problem.
> at all.

I don't follow you.

The whole point of SPIN_BUG_ON() is to catch problems that should never
happen, but nevertheless, have.

IOW, following your logic, we should remove the SPIN_BUG_ON() because
these situations should not happen.


>> As I wrote earlier, I don't think this is the way to fix
>> recursion problems with printk() [by eliding output].
>>
>> Rather, a way to effectively determine a recursion is in progress,
>> and _at a minimum_ guaranteeing that the recursive output will
>> eventually be output should be the goal.
>>
>> Including dumb recursion like a console driver printing
>> an error :/

This was in reference to a problem with spin lock recursion that
can't print. The spin lock recursion deadlocks, but you'll never
see the diagnostic because the driver is already holding the lock
(not from printk() but from some other code).

The printk doesn't even need to be directly related to the
console driver itself, but some other thing that the console driver
depends on while holding the spin lock that it claims for console output.

Deadlock, no output.

For example,

serial8250_do_set_termios()
spin_lock_irqsave() ** claim port lock **
...
serial_port_out(port, UART_LCR, ....);
dw8250_serial_out()
dev_err()
vprintk_emit()
console_trylock()
call_console_drivers()
serial8250_console_write()
spin_lock_irqsave() ** port lock **
** DEADLOCK **


> this is not a case of printk recursion and it should be handled
> just fine. console drivers are called under console_sem only.
> logbuf lock is unlocked. vprintk_emit() adds message to the logbuf,
> calls console_trylock() (which of course does not lock anything)
> and returns back to console_driver code.

Not if locks are zapped because printk() recognizes a recursion.
Note console driver's locks are not zapped. For example,

vprintk_emit()
...
call_console_drivers()
/* inside some console driver */
claim some lock
printk("%s\n", NULL); /* you get the idea */
vprintk_emit()
logbuf_lock
vscnprintf()
** oops **
vprintk_emit()
recursion detected
zap_locks()
....
call_console_drivers()
/* inside same console driver */
claim same lock
** DEADLOCK **



> the only case when we really have a printk recursion is when
> someone calls printk() from within the vprintk_emit() logbuf_lock
> area.

Not true.

A while back, Jan Kara reworked the call site around
console_trylock_from_printk(). Hung with no output under unknown
conditions [1].

Never solved, but obviously means there are unhandled recursions.

So lockdep remains disabled when calling console drivers.

Regards,
Peter Hurley

[1] https://lkml.org/lkml/2014/6/11/591


> print()
> spin_lock logbuf
> printk()
> spin_lock logbuf <<< recursion
> spin_unlock logbuf
>
>
> -ss
>
>> Then, lockdep could remain enabled while calling console drivers.
>>
>> Regards,
>> Peter Hurley
>>
>>> sem->count--
>>> spin_unlock() << unlock, return
>>> arch_spin_lock() << got the lock, return
>>> sem->count--
>>> spin_unlock() << unlock, return
>>> arch_spin_lock() << got the lock, return
>>> sem->count--
>>> spin_unlock() << unlock, return
>>>
>>>
>>> ...um
>>>
>>>
>>>> But I found there's a possiblity in the debug code *itself* to cause a
>>>> lockup.
>>>
>>> please explain.
>>>
>>> -ss
>>>
>>