On Wed, Feb 12, 2025 at 09:10:25AM -0500, Waiman Long wrote:
On 2/12/25 12:45 AM, Boqun Feng wrote:Hmm... but your patch does remove the dependency console_sem.lock ->
On Tue, Feb 11, 2025 at 09:18:56PM -0500, Waiman Long wrote:After some more thought, I realize that this patch doesn't really fix the
On 1/26/25 8:31 PM, Waiman Long wrote:FWIW, I already queued in my lockdep-for-tip branch, will send it in a
A circular lock dependency splat has been seen involving down_trylock().I can also add the following as another instance where deadlock can happen.
[ 4011.795602] ======================================================
[ 4011.795603] WARNING: possible circular locking dependency detected
[ 4011.795607] 6.12.0-41.el10.s390x+debug
[ 4011.795612] ------------------------------------------------------
[ 4011.795613] dd/32479 is trying to acquire lock:
[ 4011.795617] 0015a20accd0d4f8 ((console_sem).lock){-.-.}-{2:2}, at: down_trylock+0x26/0x90
[ 4011.795636]
[ 4011.795636] but task is already holding lock:
[ 4011.795637] 000000017e461698 (&zone->lock){-.-.}-{2:2}, at: rmqueue_bulk+0xac/0x8f0
the existing dependency chain (in reverse order) is:
-> #4 (&zone->lock){-.-.}-{2:2}:
-> #3 (hrtimer_bases.lock){-.-.}-{2:2}:
-> #2 (&rq->__lock){-.-.}-{2:2}:
-> #1 (&p->pi_lock){-.-.}-{2:2}:
-> #0 ((console_sem).lock){-.-.}-{2:2}:
The console_sem -> pi_lock dependency is due to calling try_to_wake_up()
while holding the console.sem raw_spinlock. This dependency can be broken
by using wake_q to do the wakeup instead of calling try_to_wake_up()
under the console_sem lock. This will also make the semaphore's
raw_spinlock become a terminal lock without taking any further locks
underneath it.
The hrtimer_bases.lock is a raw_spinlock while zone->lock is a
spinlock. The hrtimer_bases.lock -> zone->lock dependency happens via
the debug_objects_fill_pool() helper function in the debugobjects code.
[ 4011.795646] -> #4 (&zone->lock){-.-.}-{2:2}:
[ 4011.795650] __lock_acquire+0xe86/0x1cc0
[ 4011.795655] lock_acquire.part.0+0x258/0x630
[ 4011.795657] lock_acquire+0xb8/0xe0
[ 4011.795659] _raw_spin_lock_irqsave+0xb4/0x120
[ 4011.795663] rmqueue_bulk+0xac/0x8f0
[ 4011.795665] __rmqueue_pcplist+0x580/0x830
[ 4011.795667] rmqueue_pcplist+0xfc/0x470
[ 4011.795669] rmqueue.isra.0+0xdec/0x11b0
[ 4011.795671] get_page_from_freelist+0x2ee/0xeb0
[ 4011.795673] __alloc_pages_noprof+0x2c2/0x520
[ 4011.795676] alloc_pages_mpol_noprof+0x1fc/0x4d0
[ 4011.795681] alloc_pages_noprof+0x8c/0xe0
[ 4011.795684] allocate_slab+0x320/0x460
[ 4011.795686] ___slab_alloc+0xa58/0x12b0
[ 4011.795688] __slab_alloc.isra.0+0x42/0x60
[ 4011.795690] kmem_cache_alloc_noprof+0x304/0x350
[ 4011.795692] fill_pool+0xf6/0x450
[ 4011.795697] debug_object_activate+0xfe/0x360
[ 4011.795700] enqueue_hrtimer+0x34/0x190
[ 4011.795703] __run_hrtimer+0x3c8/0x4c0
[ 4011.795705] __hrtimer_run_queues+0x1b2/0x260
[ 4011.795707] hrtimer_interrupt+0x316/0x760
[ 4011.795709] do_IRQ+0x9a/0xe0
[ 4011.795712] do_irq_async+0xf6/0x160
Normally raw_spinlock to spinlock dependency is not legit
and will be warned if PROVE_RAW_LOCK_NESTING is enabled,
but debug_objects_fill_pool() is an exception as it explicitly
allows this dependency for non-PREEMPT_RT kernel without causing
PROVE_RAW_LOCK_NESTING lockdep splat. As a result, this dependency is
legit and not a bug.
Anyway, semaphore is the only locking primitive left that is still
using try_to_wake_up() to do wakeup inside critical section, all the
other locking primitives had been migrated to use wake_q to do wakeup
outside of the critical section. It is also possible that there are
other circular locking dependencies involving printk/console_sem or
other existing/new semaphores lurking somewhere which may show up in
the future. Let just do the migration now to wake_q to avoid headache
like this.
Reported-by:syzbot+ed801a886dfdbfe7136d@xxxxxxxxxxxxxxxxxxxxxxxxx
PR to Peter in one or two weeks (in case he hasn't taken it before
then).
BTW, do we need a "Fixes" tag for stable kernels?
circular lock dependency problem, it just remove console_sem.lock from it.
The problem is that printk() can be called in any context. To really solve
the problem, we will need some kind of deferred wakeup using workqueue, for
instance. As printing to the console is inherently slow, adding some more
pi_lock, right? So it does fix the circular lock dependency problem. Or
do you mean that it doesn't fix all the deadlocks that may cause by
printk()?