Re: Panic related to perf (bisected)

From: Tibor Billes
Date: Fri May 22 2020 - 01:15:09 EST


Hi,

On Mon, 18 May 2020, Ian Rogers wrote:

> On Sat, May 16, 2020 at 6:36 AM Billes Tibor <tbilles@xxxxxxx> wrote:
> >
> > Hi,
> >
> > I've been hitting a freeze on my laptop since 5.3, but haven't got the
> > time to finish bisecting it. Now
> > I had, and here is what I found:
> >
> > - 5.2 series works correctly (tested 5.2.9 and 5.2.15)
> > - 5.3 series and newer kernels freeze. The newest I tested is 5.6.10
> > (which also freezes). There will
> > be a complete bisect log at the end of the mail.
> >
> > There are several circumstances to reproduce the freeze. At least this
> > is what I found relevant:
> > - The freeze does not occur after a fresh boot, it needs a sleep-wakeup
> > cycle.
> > - Run `perf stat -a --topdown -- sleep 8s` every minute (It is part of
> > some metrics I collect using Zabbix)
> > - Some workload (building the kernel or gaming) increases the chance of
> > freezing, but it can occur
> > without user interaction too.
> >
> > The freeze usually comes within a few minutes after wakeup. The longest
> > was about an hour. (For
> > comparison, if I don't do a sleep-wakeup, the machine works fine for 8+
> > hours).
> >
> > First, there is a warning in syslog, then I took pictures of the actual
> > panic.
> >
> > The warning:
> > May 16 13:28:46 serpens kernel: [33128.086217] ------------[ cut here
> > ]------------
> > May 16 13:28:46 serpens kernel: [33128.086222] WARNING: CPU: 0 PID: 0 at
> > arch/x86/events/core.c:1506 x86_pmu_del+0x140/0x160
> > May 16 13:28:46 serpens kernel: [33128.086223] Modules linked in:
> > nouveau mxm_wmi ttm ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc
> > iptable_filter binfmt_misc essiv authenc uvcvideo iwlmvm
> > videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videodev mac80211
> > videobuf2_common snd_hda_codec_realtek snd_hda_codec_generic libarc4
> > snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep intel_rapl_msr
> > iwlwifi snd_hda_core intel_rapl_common snd_pcm x86_pkg_temp_thermal
> > snd_seq mei_me cfg80211 snd_seq_device snd_timer mei intel_powerclamp
> > snd soundcore ideapad_laptop coretemp sparse_keymap ip_tables x_tables
> > dm_crypt hid_generic usbhid hid i915 intel_gtt i2c_algo_bit
> > drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm
> > aesni_intel glue_helper crypto_simd r8169 ahci cryptd psmouse libahci
> > realtek drm_panel_orientation_quirks video wmi
> > May 16 13:28:46 serpens kernel: [33128.086247] CPU: 0 PID: 0 Comm:
> > swapper/0 Not tainted 5.6.10 #52
> > May 16 13:28:46 serpens kernel: [33128.086247] Hardware name: LENOVO
> > 20378/Lenovo Y50-70, BIOS 9ECN36WW(V2.00) 01/12/2015
> > May 16 13:28:46 serpens kernel: [33128.086249] RIP:
> > 0010:x86_pmu_del+0x140/0x160
> > May 16 13:28:46 serpens kernel: [33128.086250] Code: 63 d8 48 c7 84 dd
> > 20 07 00 00 00 00 00 00 4c 89 e7 89 85 14 02 00 00 e8 fe 27 16 00 e9 ef
> > fe ff ff 44 8d 6b 01 e9 5d ff ff ff <0f> 0b 5b 5d 41 5c 41 5d c3 31 db
> > e9 41 ff ff ff 41 bd 01 00 00 00
> > May 16 13:28:46 serpens kernel: [33128.086251] RSP:
> > 0018:ffffa2a380003e40 EFLAGS: 00010046
> > May 16 13:28:46 serpens kernel: [33128.086252] RAX: 0000000000000005
> > RBX: 0000000000000005 RCX: 0000000000000010
> > May 16 13:28:46 serpens kernel: [33128.086253] RDX: 0000000000000005
> > RSI: 0000000000000005 RDI: ffff907c4d4b1000
> > May 16 13:28:46 serpens kernel: [33128.086254] RBP: ffff907d932125a0
> > R08: 0000000000000002 R09: 0000000000029f80
> > May 16 13:28:46 serpens kernel: [33128.086254] R10: ffffa2a380003eb0
> > R11: 0000000000000000 R12: ffff907c4d4b1000
> > May 16 13:28:46 serpens kernel: [33128.086255] R13: 0000000000000006
> > R14: ffff907d9326fc0c R15: ffff907d9326fb00
> > May 16 13:28:46 serpens kernel: [33128.086256] FS:
> > 0000000000000000(0000) GS:ffff907d93200000(0000) knlGS:0000000000000000
> > May 16 13:28:46 serpens kernel: [33128.086256] CS: 0010 DS: 0000 ES:
> > 0000 CR0: 0000000080050033
> > May 16 13:28:46 serpens kernel: [33128.086257] CR2: 00007f713c5b44b0
> > CR3: 000000008300a001 CR4: 00000000001606f0
> > May 16 13:28:46 serpens kernel: [33128.086257] Call Trace:
> > May 16 13:28:46 serpens kernel: [33128.086259] <IRQ>
> > May 16 13:28:46 serpens kernel: [33128.086263]
> > event_sched_out.isra.116+0x89/0x1f0
> > May 16 13:28:46 serpens kernel: [33128.086264]
> > group_sched_out.part.118+0x55/0xd0
> > May 16 13:28:46 serpens kernel: [33128.086265] ctx_sched_out+0x207/0x240
> > May 16 13:28:46 serpens kernel: [33128.086267]
> > perf_mux_hrtimer_handler+0x267/0x310
> > May 16 13:28:46 serpens kernel: [33128.086269] ?
> > __perf_install_in_context+0x220/0x220
> > May 16 13:28:46 serpens kernel: [33128.086270]
> > __hrtimer_run_queues+0xfa/0x260
> > May 16 13:28:46 serpens kernel: [33128.086272] hrtimer_interrupt+0xe5/0x240
> > May 16 13:28:46 serpens kernel: [33128.086275] ?
> > recalibrate_cpu_khz+0x10/0x10
> > May 16 13:28:46 serpens kernel: [33128.086278]
> > smp_apic_timer_interrupt+0x62/0x120
> > May 16 13:28:46 serpens kernel: [33128.086280] apic_timer_interrupt+0xf/0x20
> > May 16 13:28:46 serpens kernel: [33128.086281] </IRQ>
> > May 16 13:28:46 serpens kernel: [33128.086284] RIP:
> > 0010:cpuidle_enter_state+0xb6/0x3f0
> > May 16 13:28:46 serpens kernel: [33128.086285] Code: c4 0f 1f 44 00 00
> > 31 ff e8 d7 bf a9 ff 80 7c 24 0b 00 74 12 9c 58 f6 c4 02 0f 85 12 03 00
> > 00 31 ff e8 0e 74 af ff fb 45 85 ed <0f> 88 ad 02 00 00 49 63 c5 4c 2b
> > 24 24 48 8d 14 40 48 8d 14 90 48
> > May 16 13:28:46 serpens kernel: [33128.086285] RSP:
> > 0018:ffffffffb5003e58 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
> > May 16 13:28:46 serpens kernel: [33128.086286] RAX: ffff907d9322a700
> > RBX: ffff907d93232c00 RCX: 000000000000001f
> > May 16 13:28:46 serpens kernel: [33128.086287] RDX: 00001e213bcf5a93
> > RSI: 0000000033518eeb RDI: 0000000000000000
> > May 16 13:28:46 serpens kernel: [33128.086288] RBP: ffffffffb50be6a0
> > R08: 0000000000000002 R09: 0000000000029f80
> > May 16 13:28:46 serpens kernel: [33128.086288] R10: ffffffffb5003e40
> > R11: 000000000000004c R12: 00001e213bcf5a93
> > May 16 13:28:46 serpens kernel: [33128.086289] R13: 0000000000000001
> > R14: 0000000000000001 R15: ffffffffffffffff
> > May 16 13:28:46 serpens kernel: [33128.086291] ?
> > cpuidle_enter_state+0x99/0x3f0
> > May 16 13:28:46 serpens kernel: [33128.086292] cpuidle_enter+0x29/0x40
> > May 16 13:28:46 serpens kernel: [33128.086295] do_idle+0x1ff/0x270
> > May 16 13:28:46 serpens kernel: [33128.086296] cpu_startup_entry+0x19/0x20
> > May 16 13:28:46 serpens kernel: [33128.086299] start_kernel+0x4a6/0x4b3
> > May 16 13:28:46 serpens kernel: [33128.086302]
> > secondary_startup_64+0xa4/0xb0
> > May 16 13:28:46 serpens kernel: [33128.086303] ---[ end trace
> > 6d25af1bc8d0e9ca ]---
> >
> > Here are the photos of the panic: https://postimg.cc/gallery/Ywm0Ywd
> > (Same panic, just taken
> > multiple photos to be readable).
> >
> > Here is another photo of a panic from another machine:
> > https://i.postimg.cc/d00s8Ncb/DSC-0395.jpg
> > Same setup regarding perf stat -a and the sleep-wakeup, but can't
> > remember the exact kernel
> > version I was running.
> >
> > I tried reverting fd7d55172d1e2e501e6da0a5c1de25f06612dc2e which is the
> > first bad commit
> > according to my bisect, but there were conflicts, and I'm not a kernel
> > developer, I couldn't resolve
> > them.
>
> The bisect shows bad and good alongside each other and so isn't
> categorically pointing at that change, which should be benign with
> your command lines given cgroup profiling isn't used.

I was under the impression this is how bisect works, good and bad
revisions progress as the binary search tries to pick commits in the
middle and sometimes it is before, and sometime it is after the commit in
question. Given the above I don't understand what you mean by 'so isn't
categorically pointing at that change'. Could you please explain? Is
there a way for me to do a more precise bisect? Are there any related
DEBUG configs to enable extra asserts?

> There were
> related fixes, such as 7fa343b7fdc4f351de4e3f28d5c285937dd1f42f and
> 90c91dfb86d0ff545bd329d3ddd72c147e2ae198 but neither mention panics.
> The crash is happening in the PMU code rather than the generic perf
> handling, so maybe that is at issue?

Thanks for clarifying some details!

Tibor

>
> Sorry to not have a fix for your problem,
> Ian
>
> > Bisect log:
> >
> > git bisect start
> > # bad: [70c707aae3e586f6ebc677ad733a9c3cea32f7ac] Linux 5.5.1
> > git bisect bad 70c707aae3e586f6ebc677ad733a9c3cea32f7ac
> > # good: [aad39e30fb9e6e7212318a1dad898f36f1075648] Linux 5.2.9
> > git bisect good aad39e30fb9e6e7212318a1dad898f36f1075648
> > # good: [0ecfebd2b52404ae0c54a878c872bb93363ada36] Linux 5.2
> > git bisect good 0ecfebd2b52404ae0c54a878c872bb93363ada36
> > # bad: [3c30819dc68ab9498216421b6846123900c0a6d3] Merge
> > git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
> > git bisect bad 3c30819dc68ab9498216421b6846123900c0a6d3
> > # bad: [d9b9c893048e9d308a833619f0866f1f52778cf5] Merge tag
> > 'ceph-for-5.3-rc1' of git://github.com/ceph/ceph-client
> > git bisect bad d9b9c893048e9d308a833619f0866f1f52778cf5
> > # bad: [db0457338ece7482378d88e50ad298191c3e6947] Merge branch
> > 'for-linus' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching
> > git bisect bad db0457338ece7482378d88e50ad298191c3e6947
> > # bad: [608745f12462e2d8d94d5cc5dc94bf0960a881e3] Merge branch
> > 'perf-core-for-linus' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> > git bisect bad 608745f12462e2d8d94d5cc5dc94bf0960a881e3
> > # good: [947fbd4ca9fb38f320b076e68cfccab977c5ea01] Merge tag
> > 'please-pull-for_5.3' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
> > git bisect good 947fbd4ca9fb38f320b076e68cfccab977c5ea01
> > # good: [0dcb4efb1095d0a1f5f681c2b94e98b009cc5d77] ALSA: firewire-lib:
> > code refactoring for local variables
> > git bisect good 0dcb4efb1095d0a1f5f681c2b94e98b009cc5d77
> > # good: [cf2d213e49fdf47e4c10dc629a3659e0026a54b8] Merge tag
> > 'pm-5.3-rc1' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
> > git bisect good cf2d213e49fdf47e4c10dc629a3659e0026a54b8
> > # bad: [fe96245c7f38c4ea92c1c599b43f176e27d9921e] perf hists: Add
> > block_info in hist_entry
> > git bisect bad fe96245c7f38c4ea92c1c599b43f176e27d9921e
> > # good: [87407fa58b6645cecd24102f58476b8dd7ce778d] perf config: Update
> > default value for llvm.clang-bpf-cmd-template
> > git bisect good 87407fa58b6645cecd24102f58476b8dd7ce778d
> > # skip: [78d6ccce03e86de34c7000bcada493ed0679e350] perf build: Handle
> > slang being in /usr/include and in /usr/include/slang/
> > git bisect skip 78d6ccce03e86de34c7000bcada493ed0679e350
> > # good: [b9271f0c65e12efd9b6319f57a378aac1be98b22] Merge tag 'v5.2-rc6'
> > into perf/core, to refresh branch
> > git bisect good b9271f0c65e12efd9b6319f57a378aac1be98b22
> > # good: [203dffacf592317e54480704f569a09f8b7ca380] Documentation/ABI:
> > Document umwait control sysfs interfaces
> > git bisect good 203dffacf592317e54480704f569a09f8b7ca380
> > # bad: [cf8b6970f4fc31898f3d9e25159aa57e235ca4d1] perf symbols: We need
> > util.h in symbol-elf.c for zfree()
> > git bisect bad cf8b6970f4fc31898f3d9e25159aa57e235ca4d1
> > # bad: [9f94c7f947e919c343b30f080285af53d0fa9902] perf tools: Increase
> > MAX_NR_CPUS and MAX_CACHES
> > git bisect bad 9f94c7f947e919c343b30f080285af53d0fa9902
> > # good: [122f1c51b11a9e572263c4965d772381fcef06c5] perf/x86/rapl: Get
> > MSR values from new probe framework
> > git bisect good 122f1c51b11a9e572263c4965d772381fcef06c5
> > # bad: [2d7102a0453769fd37e9f4ce68652e104fbf5c84] perf tools: Fix
> > cache.h include directive
> > git bisect bad 2d7102a0453769fd37e9f4ce68652e104fbf5c84
> > # good: [637d97b53cdded02da55d0a25cda6fd6af3bd042] perf/x86/rapl: Get
> > quirk state from new probe framework
> > git bisect good 637d97b53cdded02da55d0a25cda6fd6af3bd042
> > # bad: [fd7d55172d1e2e501e6da0a5c1de25f06612dc2e] perf/cgroups: Don't
> > rotate events for cgroups unnecessarily
> > git bisect bad fd7d55172d1e2e501e6da0a5c1de25f06612dc2e
> > # first bad commit: [fd7d55172d1e2e501e6da0a5c1de25f06612dc2e]
> > perf/cgroups: Don't rotate events for cgroups unnecessarily
> >
> > Is there any more information that I can gather to make this issue
> > easier to fix?
> >
> > Thanks,
> > Tibor Billes
>