Re: [PATCH 2/7] workqueue: Share the same PWQ for the CPUs of a pod

From: kernel test robot
Date: Tue Jan 02 2024 - 21:56:00 EST




Hello,

kernel test robot noticed "WARNING:at_kernel/workqueue.c:#destroy_workqueue" on:

commit: 3f033de3cf87ef6c769b2d55ee1df715a982d650 ("[PATCH 2/7] workqueue: Share the same PWQ for the CPUs of a pod")
url: https://github.com/intel-lab-lkp/linux/commits/Lai-Jiangshan/workqueue-Reuse-the-default-PWQ-as-much-as-possible/20231227-225337
base: https://git.kernel.org/cgit/linux/kernel/git/tj/wq.git for-next
patch link: https://lore.kernel.org/all/20231227145143.2399-3-jiangshanlai@xxxxxxxxx/
patch subject: [PATCH 2/7] workqueue: Share the same PWQ for the CPUs of a pod

in testcase: hackbench
version: hackbench-x86_64-2.3-1_20220518
with following parameters:

nr_threads: 800%
iterations: 4
mode: threads
ipc: pipe
cpufreq_governor: performance



compiler: gcc-12
test machine: 224 threads 4 sockets Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz (Cooper Lake) with 192G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
| Closes: https://lore.kernel.org/oe-lkp/202401031025.95761451-oliver.sang@xxxxxxxxx


[ 30.471685][ T1] ------------[ cut here ]------------
[ 30.476998][ T1] WARNING: CPU: 111 PID: 1 at kernel/workqueue.c:4842 destroy_workqueue (kernel/workqueue.c:4842 (discriminator 1))
[ 30.486210][ T1] Modules linked in:
[ 30.489964][ T1] CPU: 111 PID: 1 Comm: swapper/0 Not tainted 6.6.0-15761-g3f033de3cf87 #1
[ 30.498396][ T1] Hardware name: Inspur NF8260M6/NF8260M6, BIOS 06.00.01 04/22/2022
[ 30.506220][ T1] RIP: 0010:destroy_workqueue (kernel/workqueue.c:4842 (discriminator 1))
[ 30.511794][ T1] Code: c2 75 f1 48 8b 43 08 48 39 98 a0 00 00 00 74 06 83 7b 18 01 7f 14 8b 43 5c 85 c0 75 0d 48 8b 53 68 48 8d 43 68 48 39 c2 74 4e <0f> 0b 48 c7 c6 e0 1d 42 82 48 8d 95 b0 00 00 00 48 c7 c7 68 a9 93
All code
========
0: c2 75 f1 retq $0xf175
3: 48 8b 43 08 mov 0x8(%rbx),%rax
7: 48 39 98 a0 00 00 00 cmp %rbx,0xa0(%rax)
e: 74 06 je 0x16
10: 83 7b 18 01 cmpl $0x1,0x18(%rbx)
14: 7f 14 jg 0x2a
16: 8b 43 5c mov 0x5c(%rbx),%eax
19: 85 c0 test %eax,%eax
1b: 75 0d jne 0x2a
1d: 48 8b 53 68 mov 0x68(%rbx),%rdx
21: 48 8d 43 68 lea 0x68(%rbx),%rax
25: 48 39 c2 cmp %rax,%rdx
28: 74 4e je 0x78
2a:* 0f 0b ud2 <-- trapping instruction
2c: 48 c7 c6 e0 1d 42 82 mov $0xffffffff82421de0,%rsi
33: 48 8d 95 b0 00 00 00 lea 0xb0(%rbp),%rdx
3a: 48 rex.W
3b: c7 .byte 0xc7
3c: c7 (bad)
3d: 68 .byte 0x68
3e: a9 .byte 0xa9
3f: 93 xchg %eax,%ebx

Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: 48 c7 c6 e0 1d 42 82 mov $0xffffffff82421de0,%rsi
9: 48 8d 95 b0 00 00 00 lea 0xb0(%rbp),%rdx
10: 48 rex.W
11: c7 .byte 0xc7
12: c7 (bad)
13: 68 .byte 0x68
14: a9 .byte 0xa9
15: 93 xchg %eax,%ebx
[ 30.531233][ T1] RSP: 0000:ffffc90000073dd8 EFLAGS: 00010002
[ 30.537151][ T1] RAX: ffff88a444cd1000 RBX: ffff88a444ce6600 RCX: 0000000000000000
[ 30.544968][ T1] RDX: ffff88a444ce665c RSI: 0000000000000286 RDI: ffff88a4444c4000
[ 30.552785][ T1] RBP: ffff88a444cd1000 R08: 0004afcaac775f46 R09: 0004afcaac775f46
[ 30.560605][ T1] R10: ffff88984f050840 R11: 0000000000008070 R12: ffff88a444cd1020
[ 30.568430][ T1] R13: ffffc90000073e00 R14: 0000000000000462 R15: 0000000000000000
[ 30.576246][ T1] FS: 0000000000000000(0000) GS:ffff88afcf8c0000(0000) knlGS:0000000000000000
[ 30.585017][ T1] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 30.591447][ T1] CR2: 0000000000000000 CR3: 000000303e01c001 CR4: 00000000007706f0
[ 30.599266][ T1] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 30.607085][ T1] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 30.614910][ T1] PKRU: 55555554
[ 30.618314][ T1] Call Trace:
[ 30.621453][ T1] <TASK>
[ 30.624242][ T1] ? destroy_workqueue (kernel/workqueue.c:4842 (discriminator 1))
[ 30.629201][ T1] ? __warn (kernel/panic.c:677)
[ 30.633129][ T1] ? destroy_workqueue (kernel/workqueue.c:4842 (discriminator 1))
[ 30.638091][ T1] ? report_bug (lib/bug.c:180 lib/bug.c:219)
[ 30.642454][ T1] ? handle_bug (arch/x86/kernel/traps.c:237)
[ 30.646639][ T1] ? exc_invalid_op (arch/x86/kernel/traps.c:258 (discriminator 1))
[ 30.651171][ T1] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:568)
[ 30.656049][ T1] ? destroy_workqueue (kernel/workqueue.c:4842 (discriminator 1))
[ 30.661009][ T1] ? destroy_workqueue (kernel/workqueue.c:4783 kernel/workqueue.c:4842)
[ 30.665888][ T1] ? __pfx_ftrace_check_sync (kernel/trace/ftrace.c:3803)
[ 30.671200][ T1] ftrace_check_sync (kernel/trace/ftrace.c:3808)
[ 30.675820][ T1] do_one_initcall (init/main.c:1236)
[ 30.680354][ T1] do_initcalls (init/main.c:1297 init/main.c:1314)
[ 30.684625][ T1] kernel_init_freeable (init/main.c:1555)
[ 30.689678][ T1] ? __pfx_kernel_init (init/main.c:1433)
[ 30.694471][ T1] kernel_init (init/main.c:1443)
[ 30.698658][ T1] ret_from_fork (arch/x86/kernel/process.c:147)
[ 30.702927][ T1] ? __pfx_kernel_init (init/main.c:1433)
[ 30.707713][ T1] ret_from_fork_asm (arch/x86/entry/entry_64.S:250)
[ 30.712333][ T1] </TASK>
[ 30.715217][ T1] ---[ end trace 0000000000000000 ]---
[ 30.720522][ T1] destroy_workqueue: ftrace_check_wq has the following busy pwq
[ 30.728002][ T1] pwq 452: cpus=0-223 node=3 flags=0x4 nice=0 active=0/256 refcnt=56


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240103/202401031025.95761451-oliver.sang@xxxxxxxxx



--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki