Re: [PATCH (repost)] locking/lockdep: add debug_show_all_lock_holders()

From: Waiman Long
Date: Fri Sep 16 2022 - 10:51:35 EST


On 9/16/22 10:15, Tetsuo Handa wrote:
Currently, check_hung_uninterruptible_tasks() reports details of locks
held in the system. Also, lockdep_print_held_locks() does not report
details of locks held by a thread if that thread is in TASK_RUNNING state.
Several years of experience of debugging without vmcore tells me that
these limitations have been a barrier for understanding what went wrong
in syzbot's "INFO: task hung in" reports.

I initially thought that the cause of "INFO: task hung in" reports is
due to over-stressing. But I understood that over-stressing is unlikely.
I now consider that there likely is a deadlock/livelock bug where lockdep
cannot report as a deadlock when "INFO: task hung in" is reported.

A typical case is that thread-1 is waiting for something to happen (e.g.
wait_event_*()) with a lock held. When thread-2 tries to hold that lock
using e.g. mutex_lock(), check_hung_uninterruptible_tasks() reports that
thread-2 is hung and thread-1 is holding a lock which thread-2 is trying
to hold. But currently check_hung_uninterruptible_tasks() cannot report
the exact location of thread-1 which gives us an important hint for
understanding why thread-1 is holding that lock for so long period.

When check_hung_uninterruptible_tasks() reports a thread waiting for a
lock, it is important to report backtrace of threads which already held
that lock. Therefore, allow check_hung_uninterruptible_tasks() to report
the exact location of threads which is holding any lock.

I am not against this patch, but I do like to see you wrapping your code in a __debug_show_all_locks() wrapper, for instance, with flags and make debug_show_all_locks() uses the new wrapper to avoid code redundancy.


Signed-off-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
---
This is repost of https://lkml.kernel.org/r/82af40cc-bf85-2b53-b8f9-dfc12e66a781@xxxxxxxxxxxxxxxxxxx .
I think there was no critical objection which blocks this change.

I wish that lockdep continues tracking locks (i.e. debug_locks remains 1)
even after something went wrong, for recently I sometimes encounter problems
that disable lockdep during boot stage.

It would be noisy to report possibility of e.g. circular locking dependency
every time due to keeping debug_locks enabled. But tracking locks even after
something went wrong will help debug_show_all_lock_holders() to survive
problems during boot stage.

I'm not expecting lockdep to report the same problem forever.
Reporting possibility of each problem pattern (e.g. circular locking dependency)
up to once, by using cmpxchg() inside reporting functions that call printk(),
would be enough.

I'm expecting lockdep to continue working without calling printk() even after
one of problem patterns (e.g. circular locking dependency) was printk()ed, so that
debug_show_all_locks()/debug_show_all_lock_holders() can call printk() when needed.

Changing debug_locks behavior is a future patch. For now, this patch alone
will help debugging Greg's usb.git#usb-testing tree which is generating
many "INFO: task hung in" reports.

Boqun is actually working on a modularization patch to make some lockdep checking still active even after a lockdep bug is reported. I think he will take into consideration about this request.

Cheers,
Longman