lockdep splat in v5.11-rc1 involving console_sem and rq locks

From: Paul E. McKenney
Date: Tue Jan 05 2021 - 17:01:58 EST

Next message: Dexuan Cui: "RE: [PATCH] ACPI: scan: Fix a Hyper-V Linux VM panic caused by buffer overflow"
Previous message: harshad shirwadkar: "Re: [PATCH] ext4: Change list_for_each to list_for_each_entry"
Next in thread: Peter Zijlstra: "Re: lockdep splat in v5.11-rc1 involving console_sem and rq locks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello!

The RUDE01 rcutorture scenario (and less often, the TASKS01 scenario)
results in occasional lockdep splats on v5.11-rc1 on x86. This failure
is probabalistic, sometimes happening as much as 30% of the time, but
sometimes happening quite a bit less frequently. (And yes, this did
result in a false bisection. Why do you ask?) The problem seems to
happen more frequently shortly after boot, so for fastest reproduction
run lots of 10-minute RUDE01 runs, which did eventually result in a
good bisection. (Yes, I did hammer the last good commit for awhile.)

The first bad commit is 1cf12e08bc4d ("sched/hotplug: Consolidate task
migration on CPU unplug"). An example splat is shown below.

Thoughts?

Thanx, Paul

======================================================
WARNING: possible circular locking dependency detected
5.10.0-rc1+ #835 Not tainted
------------------------------------------------------
migration/1/14 is trying to acquire lock:
ffffffffb014ed38 ((console_sem).lock){-.-.}-{2:2}, at: down_trylock+0xa/0x30

but task is already holding lock:
ffff92a1df52b158 (&rq->lock){-.-.}-{2:2}, at: sched_cpu_dying+0x2b/0xe0

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&rq->lock){-.-.}-{2:2}:
lock_acquire+0xc7/0x3a0
_raw_spin_lock+0x2a/0x40
task_fork_fair+0x39/0x150
sched_fork+0x109/0x210
copy_process+0x9b7/0x1da0
kernel_clone+0x98/0x6d0
kernel_thread+0x50/0x70
rest_init+0x1d/0x236
start_kernel+0x535/0x55a
secondary_startup_64_no_verify+0xb8/0xbb

-> #1 (&p->pi_lock){-.-.}-{2:2}:
lock_acquire+0xc7/0x3a0
_raw_spin_lock_irqsave+0x33/0x50
try_to_wake_up+0x5a/0x760
up+0x3b/0x50
__up_console_sem+0x29/0x60
console_unlock+0x31e/0x550
vprintk_emit+0x1c5/0x2d0
printk+0x53/0x6a
rcu_torture_read_exit.cold.36+0x5b/0x7f
kthread+0x13b/0x160
ret_from_fork+0x22/0x30

-> #0 ((console_sem).lock){-.-.}-{2:2}:
validate_chain+0x8b4/0x1c20
__lock_acquire+0x576/0xaf0
lock_acquire+0xc7/0x3a0
_raw_spin_lock_irqsave+0x33/0x50
down_trylock+0xa/0x30
__down_trylock_console_sem+0x23/0x90
console_trylock+0xe/0x50
vprintk_emit+0xcb/0x2d0
printk+0x53/0x6a
report_bug.cold.4+0xc/0x4c
handle_bug+0x44/0x80
exc_invalid_op+0x13/0x60
asm_exc_invalid_op+0x12/0x20
sched_cpu_dying+0xd2/0xe0
cpuhp_invoke_callback+0x9e/0x890
take_cpu_down+0x62/0xa0
multi_cpu_stop+0x5e/0x100
cpu_stopper_thread+0x8a/0x130
smpboot_thread_fn+0x193/0x230
kthread+0x13b/0x160
ret_from_fork+0x22/0x30

other info that might help us debug this:

Chain exists of:
(console_sem).lock --> &p->pi_lock --> &rq->lock

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&rq->lock);
lock(&p->pi_lock);
lock(&rq->lock);
lock((console_sem).lock);

*** DEADLOCK ***

1 lock held by migration/1/14:
#0: ffff92a1df52b158 (&rq->lock){-.-.}-{2:2}, at: sched_cpu_dying+0x2b/0xe0

stack backtrace:
CPU: 1 PID: 14 Comm: migration/1 Not tainted 5.10.0-rc1+ #835
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.11.0-2.el7 04/01/2014
Stopper: multi_cpu_stop+0x0/0x100 <- 0x0
Call Trace:
dump_stack+0x77/0x97
check_noncircular+0xfe/0x110
validate_chain+0x8b4/0x1c20
__lock_acquire+0x576/0xaf0
lock_acquire+0xc7/0x3a0
? down_trylock+0xa/0x30
? vprintk_emit+0xcb/0x2d0
_raw_spin_lock_irqsave+0x33/0x50
? down_trylock+0xa/0x30
down_trylock+0xa/0x30
__down_trylock_console_sem+0x23/0x90
console_trylock+0xe/0x50
vprintk_emit+0xcb/0x2d0
printk+0x53/0x6a
? sched_cpu_dying+0xd2/0xe0
report_bug.cold.4+0xc/0x4c
handle_bug+0x44/0x80
exc_invalid_op+0x13/0x60
asm_exc_invalid_op+0x12/0x20
RIP: 0010:sched_cpu_dying+0xd2/0xe0
Code: c0 41 5c 41 5d 41 5e 41 5f 5d c3 f0 48 01 05 b5 ff 00 02 eb c9 48 89 df e8 db 10 06 00 31 c0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 0f 0b 66 2e 0f 1f 84 00 00 00 00 00 53 48 89 fb e8 a7 72 02
RSP: 0000:ffff9c0740087db0 EFLAGS: 00010002
RAX: ffff92a1c1063000 RBX: ffff92a1df52b140 RCX: 00000000000b71b0
RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff92a1df52b140
RBP: ffff9c0740087dd8 R08: ffff92a1c105d094 R09: ffff92a1c118db00
R10: ffff9c0740087bc8 R11: ffff92a1c118e400 R12: 0000000000000001
R13: ffff92a1df52b158 R14: 0000000000000082 R15: 000000000000b319
? sched_cpu_wait_empty+0x80/0x80
cpuhp_invoke_callback+0x9e/0x890
? hpet_assign_irq+0x90/0x90
take_cpu_down+0x62/0xa0
multi_cpu_stop+0x5e/0x100
? stop_machine_yield+0x10/0x10
cpu_stopper_thread+0x8a/0x130
? sort_range+0x20/0x20
smpboot_thread_fn+0x193/0x230
kthread+0x13b/0x160
? kthread_insert_work_sanity_check+0x50/0x50
ret_from_fork+0x22/0x30

Next message: Dexuan Cui: "RE: [PATCH] ACPI: scan: Fix a Hyper-V Linux VM panic caused by buffer overflow"
Previous message: harshad shirwadkar: "Re: [PATCH] ext4: Change list_for_each to list_for_each_entry"
Next in thread: Peter Zijlstra: "Re: lockdep splat in v5.11-rc1 involving console_sem and rq locks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]