Re: NULL pointer dereference in cpufreq_update_limits(?) under Xen PV dom0 - regression in 6.13
From: Jan Beulich
Date: Thu Mar 27 2025 - 06:14:54 EST
On 27.03.2025 01:51, Marek Marczykowski-Górecki wrote:
> Hi,
>
> I've got a report[1] that 6.13.6 crashes as listed below. It worked fine in
> 6.12.11. We've tried few simple things to narrow the problem down, but
> without much success.
>
> This is running in Xen 4.17.5, PV dom0, which probably is relevant here.
> This is running on AMD Ryzen 9 7950X3D, with ASRock X670E Taichi
> motherboard.
> There are few more details in the original report (link below).
>
> The kernel package (including its config saved into /boot) is here:
> https://yum.qubes-os.org/r4.2/current/host/fc37/rpm/kernel-latest-6.13.6-1.qubes.fc37.x86_64.rpm
> https://yum.qubes-os.org/r4.2/current/host/fc37/rpm/kernel-latest-modules-6.13.6-1.qubes.fc37.x86_64.rpm
>
> The crash message:
> [ 9.367048] BUG: kernel NULL pointer dereference, address: 0000000000000070
> [ 9.368251] #PF: supervisor read access in kernel mode
> [ 9.369273] #PF: error_code(0x0000) - not-present page
> [ 9.370346] PGD 0 P4D 0
> [ 9.371222] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
> [ 9.372114] CPU: 0 UID: 0 PID: 128 Comm: kworker/0:2 Not tainted 6.13.6-1.qubes.fc37.x86_64 #1
> [ 9.373184] Hardware name: ASRock X670E Taichi/X670E Taichi, BIOS 3.20 02/21/2025
> [ 9.374183] Workqueue: kacpi_notify acpi_os_execute_deferred
> [ 9.375124] RIP: e030:cpufreq_update_limits+0x10/0x30
> [ 9.375840] Code: 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 48 8b 05 98 e4 21 02 <48> 8b 40 70 48 85 c0 74 06 e9 a2 36 38 00 cc e9 ec fe ff ff 66 66
> [ 9.377009] RSP: e02b:ffffc9004058be28 EFLAGS: 00010246
> [ 9.377667] RAX: 0000000000000000 RBX: ffff888005bf4800 RCX: ffff88805d635fa8
> [ 9.378415] RDX: ffff888005bf4800 RSI: 0000000000000085 RDI: 0000000000000000
> [ 9.379127] RBP: ffff888005cd7800 R08: 0000000000000000 R09: 8080808080808080
> [ 9.379887] R10: ffff88800391abc0 R11: fefefefefefefeff R12: ffff888004e8aa00
> [ 9.380669] R13: ffff88805d635f80 R14: ffff888004e8aa15 R15: ffff8880059baf00
> [ 9.381514] FS: 0000000000000000(0000) GS:ffff88805d600000(0000) knlGS:0000000000000000
> [ 9.382345] CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 9.383045] CR2: 0000000000000070 CR3: 000000000202c000 CR4: 0000000000050660
> [ 9.383786] Call Trace:
> [ 9.384335] <TASK>
> [ 9.384886] ? __die+0x23/0x70
> [ 9.385456] ? page_fault_oops+0x95/0x190
> [ 9.386036] ? exc_page_fault+0x76/0x190
> [ 9.386636] ? asm_exc_page_fault+0x26/0x30
> [ 9.387215] ? cpufreq_update_limits+0x10/0x30
> [ 9.387805] acpi_processor_notify.part.0+0x79/0x150
> [ 9.388402] acpi_ev_notify_dispatch+0x4b/0x80
> [ 9.389013] acpi_os_execute_deferred+0x1a/0x30
> [ 9.389610] process_one_work+0x186/0x3b0
> [ 9.390205] worker_thread+0x251/0x360
> [ 9.390765] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 9.391376] ? __pfx_worker_thread+0x10/0x10
> [ 9.391957] kthread+0xd2/0x100
> [ 9.392493] ? __pfx_kthread+0x10/0x10
> [ 9.393043] ret_from_fork+0x34/0x50
> [ 9.393575] ? __pfx_kthread+0x10/0x10
> [ 9.394090] ret_from_fork_asm+0x1a/0x30
> [ 9.394621] </TASK>
> [ 9.395106] Modules linked in: gpio_generic amd_3d_vcache acpi_pad(-) loop fuse xenfs dm_thin_pool dm_persistent_data dm_bio_prison amdgpu amdxcp i2c_algo_bit drm_ttm_helper ttm crct10dif_pclmul drm_exec crc32_pclmul gpu_sched
> crc32c_intel drm_suballoc_helper polyval_clmulni drm_panel_backlight_quirks polyval_generic drm_buddy ghash_clmulni_intel sha512_ssse3 drm_display_helper sha256_ssse3 sha1_ssse3 xhci_pci cec nvme sp5100_tco xhci_hcd nvme_core nvme_auth
> video wmi xen_acpi_processor xen_privcmd xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn scsi_dh_rdac scsi_dh_emc scsi_dh_alua uinput dm_multipath
> [ 9.398698] CR2: 0000000000000070
> [ 9.399266] ---[ end trace 0000000000000000 ]---
> [ 9.399880] RIP: e030:cpufreq_update_limits+0x10/0x30
> [ 9.400528] Code: 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 48 8b 05 98 e4 21 02 <48> 8b 40 70 48 85 c0 74 06 e9 a2 36 38 00 cc e9 ec fe ff ff 66 66
> [ 9.401673] RSP: e02b:ffffc9004058be28 EFLAGS: 00010246
> [ 9.402316] RAX: 0000000000000000 RBX: ffff888005bf4800 RCX: ffff88805d635fa8
> [ 9.403060] RDX: ffff888005bf4800 RSI: 0000000000000085 RDI: 0000000000000000
> [ 9.403819] RBP: ffff888005cd7800 R08: 0000000000000000 R09: 8080808080808080
> [ 9.404581] R10: ffff88800391abc0 R11: fefefefefefefeff R12: ffff888004e8aa00
> [ 9.405332] R13: ffff88805d635f80 R14: ffff888004e8aa15 R15: ffff8880059baf00
> [ 9.406063] FS: 0000000000000000(0000) GS:ffff88805d600000(0000) knlGS:0000000000000000
> [ 9.406830] CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 9.407561] CR2: 0000000000000070 CR3: 000000000202c000 CR4: 0000000000050660
> [ 9.408318] Kernel panic - not syncing: Fatal exception
> [ 9.409022] Kernel Offset: disabled
> (XEN) Hardware Dom0 crashed: 'noreboot' set - not rebooting.
>
> Looking at the call trace, it's likely related to ACPI, and Xen too, so
> I'm adding relevant lists too.
>
> Any ideas?
>
> #regzbot introduced: v6.12.11..v6.13.6
That code looks to have been introduced for 6.9, so I wonder if so far you merely
were lucky not to have observed any "highest perf changed" notification. See
9c4a13a08a9b ("ACPI: cpufreq: Add highest perf change notification"), which imo
merely adds a 2nd path to a pre-existing problem: cpufreq_update_limits() assumes
that cpufreq_driver is non-NULL, and only checks cpufreq_driver->update_limits.
But of course the assumption there may be legitimate, and it's logic elsewhere
which is or has become flawed.
Jan