Traceback in ww_mutex test (test_cycle_work) on arm64/x86_64

From: Guenter Roeck
Date: Sun Sep 23 2018 - 15:57:14 EST


Hi,

when enabling CONFIG_WW_MUTEX_SELFTEST on arm64 or x86_64,
I get the following traceback.

[ 3.111852] ------------[ cut here ]------------
[ 3.112100] DEBUG_LOCKS_WARN_ON(__owner_task(owner) != current)
[ 3.112753] WARNING: CPU: 1 PID: 771 at kernel/locking/mutex.c:1211 __mutex_unlock_slowpath+0x1a8/0x2e0
[ 3.113238] Modules linked in:
[ 3.113774] CPU: 1 PID: 771 Comm: kworker/u16:8 Not tainted 4.19.0-rc5-dirty #1
[ 3.114025] Hardware name: linux,dummy-virt (DT)
[ 3.114587] Workqueue: test-ww_mutex test_cycle_work
[ 3.114950] pstate: 40000005 (nZcv daif -PAN -UAO)
[ 3.115144] pc : __mutex_unlock_slowpath+0x1a8/0x2e0
[ 3.115327] lr : __mutex_unlock_slowpath+0x1a8/0x2e0
[ 3.115500] sp : ffff00000b7cbc40
[ 3.115647] x29: ffff00000b7cbc40 x28: 0000000000000000
[ 3.115921] x27: ffff00000942f000 x26: ffff00000a204da0
[ 3.116155] x25: ffff00000a1c93d0 x24: ffff000009103cd8
[ 3.116376] x23: ffff00000a1c9000 x22: ffff00000942f000
[ 3.116596] x21: ffff00000b7cbca8 x20: ffff80001c05f8c8
[ 3.116817] x19: 0000000000000000 x18: ffffffffffffffff
[ 3.117036] x17: 0000000000000000 x16: 0000000000000000
[ 3.117256] x15: ffff00000942f808 x14: ffff00008a1c8bb7
[ 3.117476] x13: ffff00000a1c8bc5 x12: ffff00000944f000
[ 3.117695] x11: 0000000005f5e0ff x10: ffff0000094b3000
[ 3.117947] x9 : 0000000000000000 x8 : ffff00000942f808
[ 3.118172] x7 : ffff00000816153c x6 : 0000000000000000
[ 3.118392] x5 : 0000000000000000 x4 : ffff00000b7cc000
[ 3.118612] x3 : 6172e063a21fe200 x2 : ffff00000944fd80
[ 3.118830] x1 : 6172e063a21fe200 x0 : 0000000000000000
[ 3.119169] Call trace:
[ 3.119348] __mutex_unlock_slowpath+0x1a8/0x2e0
[ 3.119540] ww_mutex_unlock+0x48/0xa0
[ 3.119709] test_cycle_work+0x10c/0x220
[ 3.119864] process_one_work+0x29c/0x708
[ 3.120016] worker_thread+0x40/0x458
[ 3.120179] kthread+0x12c/0x130
[ 3.120317] ret_from_fork+0x10/0x18

Debugging shows that the traceback occurs in the following code
in test_cycle_work().

+ err = ww_mutex_lock(cycle->b_mutex, &ctx);
+ if (err == -EDEADLK) {
# true
+ ww_mutex_unlock(&cycle->a_mutex);
+ ww_mutex_lock_slow(cycle->b_mutex, &ctx);
+ err = ww_mutex_lock(&cycle->a_mutex, &ctx);
# returns with err == -EDEADLK
+ }
+
+ if (!err)
+ ww_mutex_unlock(cycle->b_mutex);
+ ww_mutex_unlock(&cycle->a_mutex);
# traceback seen here:
# unlocks a_mutex even though it was not
# acquired by this thread

The problem is quite easy to reproduce with the following qemu command.

qemu-system-aarch64 -M virt -cpu cortex-a57 -nographic -monitor none \
-kernel arch/arm64/boot/Image -no-reboot -smp 8 -m 512
-device virtio-blk-pci,drive=d0 \
-drive file=rootfs.ext2,if=none,id=d0,format=raw \
-append 'console=ttyAMA0 root=/dev/vda rw'

or:

qemu-system-x86_64 \
-kernel arch/x86/boot/bzImage \
-M q35 \
-cpu Skylake-Server \
-no-reboot -smp 8 -m 1G \
-usb -device usb-storage,drive=d0 \
-drive file=rootfs.ext2,if=none,id=d0,format=raw \
--append 'root=/dev/sda rw rootwait console=ttyS0 console=tty' \
-nographic

Details don't really matter as long as the number of CPUs is at least 8
(I have not seen the problem with 1, 2, 4, or 6 CPUs). My test system
has 8 CPU cores (times 2 for hyperthreading), so that may be related.

The test case above is clearly wrong if both calls to ww_mutex_lock()
fail with -EDEADLK. Unfortunately I don't know the expected behavior
in this case, so I'll have to pass this on without a proposed fix.

Please let me know if there is anything I can do to help fixing
the problem.

Thanks,
Guenter