Re: selftests: user_events: ftrace_test - RIP: 0010:tracing_update_buffers (kernel/trace/trace.c:6470)

From: Beau Belgrave
Date: Fri Oct 27 2023 - 15:20:25 EST


On Fri, Oct 27, 2023 at 05:38:41PM +0530, Naresh Kamboju wrote:
> Following kernel crash noticed on x86_64 while running selftests: user_events:
> ftrace_test running 6.6.0-rc7-next-20231026.
>
> Reported-by: Linux Kernel Functional Testing <lkft@xxxxxxxxxx>
>
> kselftest: Running tests in user_events
> TAP version 13
> 1..4
> # timeout set to 90
> # selftests: user_events: ftrace_test
> [ 2391.606817] general protection fault, probably for non-canonical
> address 0x6b6b6b6b6b6b8a83: 0000 [#1] PREEMPT SMP PTI
> [ 2391.617519] CPU: 1 PID: 34662 Comm: ftrace_test Not tainted
> 6.6.0-rc7-next-20231026 #1
> [ 2391.625428] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> 2.7 12/07/2021
> [ 2391.632811] RIP: 0010:tracing_update_buffers (kernel/trace/trace.c:6470)
> [ 2391.637952] Code: 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00
> 55 31 f6 48 89 e5 41 55 41 54 53 48 89 fb 48 c7 c7 40 8c 61 94 e8 92
> d3 5a 01 <44> 0f b6 a3 18 1f 00 00 41 80 fc 01 0f 87 c8 dc 4e 01 45 31
> ed 41
> All code
> ========
> 0: 90 nop
> 1: 90 nop
> 2: 90 nop
> 3: 90 nop
> 4: 90 nop
> 5: 90 nop
> 6: 90 nop
> 7: 90 nop
> 8: 90 nop
> 9: 90 nop
> a: 90 nop
> b: 90 nop
> c: 66 0f 1f 00 nopw (%rax)
> 10: 55 push %rbp
> 11: 31 f6 xor %esi,%esi
> 13: 48 89 e5 mov %rsp,%rbp
> 16: 41 55 push %r13
> 18: 41 54 push %r12
> 1a: 53 push %rbx
> 1b: 48 89 fb mov %rdi,%rbx
> 1e: 48 c7 c7 40 8c 61 94 mov $0xffffffff94618c40,%rdi
> 25: e8 92 d3 5a 01 callq 0x15ad3bc
> 2a:* 44 0f b6 a3 18 1f 00 movzbl 0x1f18(%rbx),%r12d <-- trapping instruction
> 31: 00
> 32: 41 80 fc 01 cmp $0x1,%r12b
> 36: 0f 87 c8 dc 4e 01 ja 0x14edd04
> 3c: 45 31 ed xor %r13d,%r13d
> 3f: 41 rex.B
>
> Code starting with the faulting instruction
> ===========================================
> 0: 44 0f b6 a3 18 1f 00 movzbl 0x1f18(%rbx),%r12d
> 7: 00
> 8: 41 80 fc 01 cmp $0x1,%r12b
> c: 0f 87 c8 dc 4e 01 ja 0x14edcda
> 12: 45 31 ed xor %r13d,%r13d
> 15: 41 rex.B
> [ 2391.656696] RSP: 0018:ffffb36e0a477d80 EFLAGS: 00010246
> [ 2391.661937] RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000080000000
> [ 2391.669064] RDX: 0000000000000000 RSI: ffffffff9299b722 RDI: ffffffff9299b722
> [ 2391.676195] RBP: ffffb36e0a477d98 R08: 000000000000002f R09: 0000000000000002
> [ 2391.683321] R10: ffffb36e0a477d70 R11: 0000000000000000 R12: 0000000000000002
> [ 2391.690453] R13: ffffb36e0a477e88 R14: ffff99c5803a2230 R15: ffff99c581c39000
> [ 2391.697586] FS: 00007fb4b9681740(0000) GS:ffff99c6efa80000(0000)
> knlGS:0000000000000000
> [ 2391.705670] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2391.711410] CR2: 00007fb4b96ab5e0 CR3: 000000010635c002 CR4: 00000000003706f0
> [ 2391.718540] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 2391.725665] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 2391.732797] Call Trace:
> [ 2391.735240] <TASK>
> [ 2391.737339] ? show_regs (arch/x86/kernel/dumpstack.c:479)
> [ 2391.740744] ? die_addr (arch/x86/kernel/dumpstack.c:421
> arch/x86/kernel/dumpstack.c:460)
> [ 2391.744056] ? exc_general_protection (arch/x86/kernel/traps.c:697
> arch/x86/kernel/traps.c:642)
> [ 2391.748766] ? asm_exc_general_protection
> (arch/x86/include/asm/idtentry.h:564)
> [ 2391.753652] ? __mutex_lock (kernel/locking/mutex.c:613
> (discriminator 3) kernel/locking/mutex.c:747 (discriminator 3))
> [ 2391.757487] ? __mutex_lock (kernel/locking/mutex.c:613
> (discriminator 3) kernel/locking/mutex.c:747 (discriminator 3))
> [ 2391.761318] ? tracing_update_buffers (kernel/trace/trace.c:6470)
> [ 2391.765851] event_enable_write (kernel/trace/trace_events.c:1408)
> [ 2391.769976] vfs_write (fs/read_write.c:582)
> [ 2391.773296] ? close_fd_get_file (fs/file.c:821)
> [ 2391.777396] ? preempt_count_sub (kernel/sched/core.c:5857
> kernel/sched/core.c:5853 kernel/sched/core.c:5875)
> [ 2391.781496] ksys_write (fs/read_write.c:638)
> [ 2391.784918] __x64_sys_write (fs/read_write.c:646)
> [ 2391.788671] do_syscall_64 (arch/x86/entry/common.c:51
> arch/x86/entry/common.c:82)
> [ 2391.792248] ? do_syscall_64 (arch/x86/entry/common.c:101)
> [ 2391.795995] ? syscall_exit_to_user_mode (kernel/entry/common.c:299)
> [ 2391.800785] ? do_syscall_64 (arch/x86/entry/common.c:101)
> [ 2391.804529] ? do_syscall_64 (arch/x86/entry/common.c:101)
> [ 2391.808275] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
> [ 2391.813327] RIP: 0033:0x7fb4b977c140
> [ 2391.816920] Code: 40 00 48 8b 15 c1 9c 0d 00 f7 d8 64 89 02 48 c7
> c0 ff ff ff ff eb b7 0f 1f 00 80 3d a1 24 0e 00 00 74 17 b8 01 00 00
> 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28
> 48 89
> All code
> ========
> 0: 40 00 48 8b add %cl,-0x75(%rax)
> 4: 15 c1 9c 0d 00 adc $0xd9cc1,%eax
> 9: f7 d8 neg %eax
> b: 64 89 02 mov %eax,%fs:(%rdx)
> e: 48 c7 c0 ff ff ff ff mov $0xffffffffffffffff,%rax
> 15: eb b7 jmp 0xffffffffffffffce
> 17: 0f 1f 00 nopl (%rax)
> 1a: 80 3d a1 24 0e 00 00 cmpb $0x0,0xe24a1(%rip) # 0xe24c2
> 21: 74 17 je 0x3a
> 23: b8 01 00 00 00 mov $0x1,%eax
> 28: 0f 05 syscall
> 2a:* 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax <--
> trapping instruction
> 30: 77 58 ja 0x8a
> 32: c3 retq
> 33: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
> 3a: 48 83 ec 28 sub $0x28,%rsp
> 3e: 48 rex.W
> 3f: 89 .byte 0x89
>
> Code starting with the faulting instruction
> ===========================================
> 0: 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax
> 6: 77 58 ja 0x60
> 8: c3 retq
> 9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
> 10: 48 83 ec 28 sub $0x28,%rsp
> 14: 48 rex.W
> 15: 89 .byte 0x89
> [ 2391.835660] RSP: 002b:00007ffc43b05b38 EFLAGS: 00000202 ORIG_RAX:
> 0000000000000001
> [ 2391.843225] RAX: ffffffffffffffda RBX: 00007ffc43b05d88 RCX: 00007fb4b977c140
> [ 2391.850350] RDX: 0000000000000002 RSI: 000056376b59b7d4 RDI: 0000000000000007
> [ 2391.857482] RBP: 00007ffc43b05b60 R08: 0000000000000000 R09: 00007fb4b9681740
> [ 2391.864615] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
> [ 2391.871747] R13: 00007ffc43b05d98 R14: 000056376b59ddc8 R15: 00007fb4b9981020
> [ 2391.878907] </TASK>
> [ 2391.881106] Modules linked in: x86_pkg_temp_thermal fuse configfs
> [ 2391.887288] ---[ end trace 0000000000000000 ]---
> [ 2391.891915] RIP: 0010:tracing_update_buffers (kernel/trace/trace.c:6470)
> [ 2391.897231] Code: 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00
> 55 31 f6 48 89 e5 41 55 41 54 53 48 89 fb 48 c7 c7 40 8c 61 94 e8 92
> d3 5a 01 <44> 0f b6 a3 18 1f 00 00 41 80 fc 01 0f 87 c8 dc 4e 01 45 31
> ed 41
> All code
> ========
> 0: 90 nop
> 1: 90 nop
> 2: 90 nop
> 3: 90 nop
> 4: 90 nop
> 5: 90 nop
> 6: 90 nop
> 7: 90 nop
> 8: 90 nop
> 9: 90 nop
> a: 90 nop
> b: 90 nop
> c: 66 0f 1f 00 nopw (%rax)
> 10: 55 push %rbp
> 11: 31 f6 xor %esi,%esi
> 13: 48 89 e5 mov %rsp,%rbp
> 16: 41 55 push %r13
> 18: 41 54 push %r12
> 1a: 53 push %rbx
> 1b: 48 89 fb mov %rdi,%rbx
> 1e: 48 c7 c7 40 8c 61 94 mov $0xffffffff94618c40,%rdi
> 25: e8 92 d3 5a 01 callq 0x15ad3bc
> 2a:* 44 0f b6 a3 18 1f 00 movzbl 0x1f18(%rbx),%r12d <-- trapping instruction
> 31: 00
> 32: 41 80 fc 01 cmp $0x1,%r12b
> 36: 0f 87 c8 dc 4e 01 ja 0x14edd04
> 3c: 45 31 ed xor %r13d,%r13d
> 3f: 41 rex.B
>
> Code starting with the faulting instruction
> ===========================================
> 0: 44 0f b6 a3 18 1f 00 movzbl 0x1f18(%rbx),%r12d
> 7: 00
> 8: 41 80 fc 01 cmp $0x1,%r12b
> c: 0f 87 c8 dc 4e 01 ja 0x14edcda
> 12: 45 31 ed xor %r13d,%r13d
> 15: 41 rex.B
> [ 2391.916120] RSP: 0018:ffffb36e0a477d80 EFLAGS: 00010246
> [ 2391.921569] RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000080000000
> [ 2391.928872] RDX: 0000000000000000 RSI: ffffffff9299b722 RDI: ffffffff9299b722
> [ 2391.936237] RBP: ffffb36e0a477d98 R08: 000000000000002f R09: 0000000000000002
> [ 2391.943388] R10: ffffb36e0a477d70 R11: 0000000000000000 R12: 0000000000000002
> [ 2391.950527] R13: ffffb36e0a477e88 R14: ffff99c5803a2230 R15: ffff99c581c39000
> [ 2391.957670] FS: 00007fb4b9681740(0000) GS:ffff99c6efa80000(0000)
> knlGS:0000000000000000
> [ 2391.965822] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2391.971579] CR2: 00007fb4b96ab5e0 CR3: 000000010635c002 CR4: 00000000003706f0
> [ 2391.978721] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 2391.985879] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 2391.993028] Kernel panic - not syncing: Fatal exception
> [ 2391.998287] Kernel Offset: 0x10000000 from 0xffffffff81000000
> (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [ 2392.009066] ---[ end Kernel panic - not syncing: Fatal exception ]---
>

It's hard to repro this, it does repro after running several times for
me.

Steven, this is firing when the self test writes 1 to the enable file
for a user_event created event.

Sometimes I get this splat instead of a panic:
[ 956.819778] ------------[ cut here ]------------
[ 956.820526] WARNING: CPU: 12 PID: 914 at kernel/trace/trace.c:516 tracing_release_file_tr+0x46/0x50
[ 956.821389] Modules linked in:
[ 956.821708] CPU: 12 PID: 914 Comm: ftrace_test Not tainted 6.6.0-rc7-next-20231026 #1
[ 956.822501] RIP: 0010:tracing_release_file_tr+0x46/0x50
[ 956.822902] Code: d1 03 01 8b 83 c0 1e 00 00 85 c0 74 1d 83 e8 01 48 c7 c7 80 5b ef 84 89 83 c0 1e 00 00 e8 b2 b5 03 01 31 c0 5b e9 75 ee 27 01 <0f> 0b eb df 66 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90
[ 956.824261] RSP: 0018:ffffb59d879e3ee8 EFLAGS: 00010246
[ 956.824656] RAX: 0000000000000000 RBX: ffff9e386a82c058 RCX: 0000000000000000
[ 956.825239] RDX: 0000000000000000 RSI: ffffffff84b6b383 RDI: 00000000ffffffff
[ 956.825835] RBP: ffff9e3844076200 R08: 0000000000000000 R09: 0000000000000000
[ 956.826425] R10: ffffb59d879e3ee8 R11: ffffffff8328526f R12: 00000000000f801f
[ 956.827359] R13: ffff9e3846134020 R14: ffff9e3843fd03a8 R15: 0000000000000000
[ 956.828732] FS: 00007eff4f818740(0000) GS:ffff9e3f26300000(0000) knlGS:0000000000000000
[ 956.829436] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 956.829968] CR2: 00007eff4f42a250 CR3: 00000007d0744004 CR4: 0000000000370eb0
[ 956.830591] Call Trace:
[ 956.830804] <TASK>
[ 956.831113] ? __warn+0x7f/0x160
[ 956.831469] ? tracing_release_file_tr+0x46/0x50
[ 956.831891] ? report_bug+0x1c3/0x1d0
[ 956.832224] ? handle_bug+0x3c/0x70
[ 956.832544] ? exc_invalid_op+0x14/0x70
[ 956.832861] ? asm_exc_invalid_op+0x16/0x20
[ 956.833190] ? tracing_release_file_tr+0x1f/0x50
[ 956.833727] ? tracing_release_file_tr+0x46/0x50
[ 956.834159] ? tracing_release_file_tr+0x1f/0x50
[ 956.834579] __fput+0xab/0x300
[ 956.834903] __x64_sys_close+0x38/0x80
[ 956.835235] do_syscall_64+0x41/0xf0
[ 956.835554] entry_SYSCALL_64_after_hwframe+0x6c/0x74
[ 956.835974] RIP: 0033:0x7eff4f515157
[ 956.836291] Code: ff e8 0d 16 02 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 83 b8 f7 ff
[ 956.837728] RSP: 002b:00007ffc021d8158 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[ 956.838351] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007eff4f515157
[ 956.838975] RDX: 0000000000000002 RSI: 000055637921d7d4 RDI: 0000000000000005
[ 956.839586] RBP: 00007ffc021d8180 R08: 0000000000000000 R09: 00007eff4f818740
[ 956.840201] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffc021d83b8
[ 956.840810] R13: 000055637921cae7 R14: 000055637921fc38 R15: 00007eff4f946040
[ 956.841562] </TASK>
[ 956.841787] irq event stamp: 881
[ 956.842145] hardirqs last enabled at (889): [<ffffffff831d7f49>] console_unlock+0x109/0x130
[ 956.843013] hardirqs last disabled at (896): [<ffffffff831d7f2e>] console_unlock+0xee/0x130
[ 956.843648] softirqs last enabled at (694): [<ffffffff8314ba2a>] irq_exit_rcu+0x8a/0xe0
[ 956.844276] softirqs last disabled at (689): [<ffffffff8314ba2a>] irq_exit_rcu+0x8a/0xe0
[ 956.844893] ---[ end trace 0000000000000000 ]---

Warning is from this code:
static void __trace_array_put(struct trace_array *this_tr)
{
WARN_ON(!this_tr->ref);
this_tr->ref--;
}

It seems like there might be a timing window or an incorrect call to
trace_array_put() somewhere. Do you think this is related to the eventfs
work?

Thanks,
-Beau

> Links:
> - https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20231026/testrun/20823454/suite/log-parser-test/tests/
> - https://lkft.validation.linaro.org/scheduler/job/6974179#L5053
>
> metadata:
> git_ref: master
> git_repo: https://gitlab.com/Linaro/lkft/mirrors/next/linux-next
> git_sha: 2ef7141596eed0b4b45ef18b3626f428a6b0a822
> git_describe: next-20231026
> kernel_version: 6.6.0-rc7
> kernel-config:
> https://storage.tuxsuite.com/public/linaro/lkft/builds/2XHt24sNSdog7DYY3FLKFZpZmjG/config
> artifact-location:
> https://storage.tuxsuite.com/public/linaro/lkft/builds/2XHt24sNSdog7DYY3FLKFZpZmjG/
> toolchain: gcc-13
>
>
> --
> Linaro LKFT
> https://lkft.linaro.org