amdgpu - BUG: kernel NULL pointer dereference, address: 0000000000000000

From: David C. Rankin
Date: Wed Jun 29 2022 - 03:06:07 EST


All,

There appears to be a bug (regression maybe?) in the amdgpu driver resulting in a Fatal error during GPU init. This began with the 5.17 kernel and is still present in the current 5.18 kernel. However, the effect/consequence on the kernel due to the NULL pointer dereference seems to be getting worse and not causes the machine to hang at the end of the shutdown procedure. (tough for boxes that are remote adminned).

I have two servers with old AMD cards that have this exact problem. lspci -v (as user) reports the card as:

01:00.1 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] RV370 [Radeon X300 SE]
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device 0f03
Flags: fast devsel, NUMA node 0
Memory at fea20000 (32-bit, non-prefetchable) [size=64K]
Capabilities: <access denied>
Kernel modules: amdgpu

The host is:

Host: valkyrie Kernel: 5.18.7-arch1-1 arch: x86_64 bits: 64 compiler: gcc
v: 12.1.0 parameters: BOOT_IMAGE=/vmlinuz-linux
root=UUID=515ef9dc-769f-4548-9a08-3a92fa83d86b rw iommu=soft
amd_iommu_dump= quiet audit=0
Console: pty pts/0 DM: LightDM v: 1.30.0 Distro: Arch Linux

Machine:
Type: Desktop Mobo: Gigabyte model: 990FXA-UD3 v: x.x serial: N/A
BIOS: American Megatrends v: F3 date: 05/28/2015

Memory:
RAM: total: 31.31 GiB used: 1012.9 MiB (3.2%)

CPU:
Info: model: AMD FX-8350 socket: AM3 bits: 64 type: MT MCP arch: Piledriver
built: 2012-13 process: GF 32nm family: 0x15 (21) model-id: 2 stepping: 0
microcode: 0x6000852

Graphics:
Device-1: AMD RV370 [Radeon X300] driver: radeon v: kernel
alternate: amdgpu arch: Rage 9 code: R360-R400 process: TSMC 110nm
built: 2003-08 pcie: gen: 1 speed: 2.5 GT/s lanes: 16 ports:
active: DVI-I-1 empty: SVIDEO-1 bus-ID: 01:00.0 chip-ID: 1002:5b60
class-ID: 0300

The NULL pointer dereference occurs during GPU init of the card. These cards are fanless and specifically chosen for that. They are used in server installs and have been flawless for years. If it was just one card acting up, I could see it may be a card problem, but I have two identical servers setup with this card and both show the exact same "BUG: kernel NULL pointer dereference":

[ 9.660937] [drm] amdgpu kernel modesetting enabled.
[ 9.661025] amdgpu: CRAT table not found
[ 9.661028] amdgpu: Virtual CRAT table created for CPU
[ 9.661040] amdgpu: Topology: Add CPU node
[ 9.661296] [drm] initializing kernel modesetting (IP DISCOVERY 0x1002:0x5B70 0x1002:0x0F03 0x00).
[ 9.661302] amdgpu 0000:01:00.1: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
[ 9.661305] amdgpu 0000:01:00.1: amdgpu: Fatal error during GPU init
[ 9.661318] amdgpu: probe of 0000:01:00.1 failed with error -12
[ 9.661338] BUG: kernel NULL pointer dereference, address: 0000000000000000

Full dmesg output for this with backtrace is attached.

Bugs related to this problem are open with freedesktop, and with Archinux.

https://gitlab.freedesktop.org/drm/amd/-/issues/2070

and

https://bugs.archlinux.org/task/74346#comment209209

Are those the proper locations for the bug report or does a kernel bug also need to be opened to track the issue? Let me know there and let me know if you need any further information from the machines and I'm happy to get it.

--
David C. Rankin, J.D.,P.E.[ 9.660937] [drm] amdgpu kernel modesetting enabled.
[ 9.661025] amdgpu: CRAT table not found
[ 9.661028] amdgpu: Virtual CRAT table created for CPU
[ 9.661040] amdgpu: Topology: Add CPU node
[ 9.661296] [drm] initializing kernel modesetting (IP DISCOVERY 0x1002:0x5B70 0x1002:0x0F03 0x00).
[ 9.661302] amdgpu 0000:01:00.1: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
[ 9.661305] amdgpu 0000:01:00.1: amdgpu: Fatal error during GPU init
[ 9.661318] amdgpu: probe of 0000:01:00.1 failed with error -12
[ 9.661338] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 9.661384] #PF: supervisor write access in kernel mode
[ 9.661411] #PF: error_code(0x0002) - not-present page
[ 9.661440] PGD 0 P4D 0
[ 9.661454] Oops: 0002 [#1] PREEMPT SMP NOPTI
[ 9.661479] CPU: 3 PID: 358 Comm: systemd-udevd Tainted: G OE 5.18.7-arch1-1 #1 b361f845a00a4369e3
079c139378bcbc5b131d49
[ 9.661543] Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./990FXA-UD3, BIOS F3 05/28/2015
[ 9.661595] RIP: 0010:amdgpu_device_fini_sw+0x2a4/0x390 [amdgpu]
[ 9.662020] Code: 82 00 00 00 48 89 df e8 da a2 04 00 48 83 bb e8 5f 00 00 00 74 08 48 89 df e8 68 42 04 00 48 8
b bb f0 74 01 00 b8 ff ff ff ff <f0> 0f c1 07 83 f8 01 74 4c 85 c0 0f 8e c3 00 00 00 48 c7 83 f0 74
[ 9.662120] RSP: 0018:ffffba0ec0e17b28 EFLAGS: 00010246
[ 9.662149] RAX: 00000000ffffffff RBX: ffff9c31c3e00000 RCX: 0000000000000000
[ 9.662189] RDX: 00000000000305c0 RSI: 0000000000000000 RDI: 0000000000000000
[ 9.662226] RBP: ffff9c31c3e00010 R08: ffffba0ec0e17ba8 R09: ffff9c31c01cd9d0
[ 9.662267] R10: 000000000000002a R11: ffff9c31c5d3c6d8 R12: ffffba0ec0e17ba8
[ 9.662305] R13: ffff9c31c10560d0 R14: ffff9c31c1056374 R15: ffffba0ec0e17db0
[ 9.662344] FS: 00007febac4b6080(0000) GS:ffff9c38becc0000(0000) knlGS:0000000000000000
[ 9.662412] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9.662443] CR2: 0000000000000000 CR3: 0000000106b2a000 CR4: 00000000000406e0
[ 9.662482] Call Trace:
[ 9.662498] <TASK>
[ 9.662512] amdgpu_driver_release_kms+0x16/0x30 [amdgpu 4c56326c653da05dd809a4247720710306fcf0fb]
[ 9.662887] devm_drm_dev_init_release+0x43/0x60
[ 9.662915] release_nodes+0x38/0xb0
[ 9.662938] devres_release_all+0x8c/0xc0
[ 9.662962] device_unbind_cleanup+0xe/0x70
[ 9.662988] really_probe+0x143/0x370
[ 9.663012] __driver_probe_device+0xfc/0x170
[ 9.663036] driver_probe_device+0x1f/0x90
[ 9.663060] __driver_attach+0xbf/0x1a0
[ 9.663084] ? __device_attach_driver+0xe0/0xe0
[ 9.663113] bus_for_each_dev+0x87/0xd0
[ 9.663136] bus_add_driver+0x15d/0x200
[ 9.663158] driver_register+0x8d/0xe0
[ 9.663181] ? 0xffffffffc0baa000
[ 9.663201] do_one_initcall+0x5d/0x220
[ 9.663228] do_init_module+0x4a/0x240
[ 9.663251] __do_sys_init_module+0x138/0x1b0
[ 9.663280] do_syscall_64+0x5f/0x90
[ 9.663302] ? __vm_munmap+0x90/0x110
[ 9.663324] ? syscall_exit_to_user_mode+0x26/0x50
[ 9.663351] ? __x64_sys_munmap+0x1b/0x20
[ 9.663351] ? __x64_sys_munmap+0x1b/0x20
[ 9.663375] ? do_syscall_64+0x6b/0x90
[ 9.663400] ? syscall_exit_to_user_mode+0x26/0x50
[ 9.663428] ? do_syscall_64+0x6b/0x90
[ 9.663455] ? exc_page_fault+0x74/0x170
[ 9.663484] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 9.663518] RIP: 0033:0x7febabd1299e
[ 9.663543] Code: 48 8b 0d fd a3 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ca a3 0e 00 f7 d8 64 89 01 48
[ 9.663673] RSP: 002b:00007ffee72fa658 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
[ 9.663747] RAX: ffffffffffffffda RBX: 00005632d080cc00 RCX: 00007febabd1299e
[ 9.663804] RDX: 00007febac4ca32c RSI: 00000000010e8ebe RDI: 00007feba8b83010
[ 9.663842] RBP: 00007feba8b83010 R08: 0000000000261000 R09: 85ebca77c2b2ae63
[ 9.663879] R10: 0000000000035721 R11: 0000000000000246 R12: 00007febac4ca32c
[ 9.663918] R13: 00005632d0808a10 R14: 00005632d080cc00 R15: 00005632d0812310
[ 9.663958] </TASK>
[ 9.663969] Modules linked in: ccp snd_hda_codec_realtek amdgpu(+) snd_hda_codec_generic rng_core ledtrig_audio snd_hda_intel snd_intel_dspcfg kvm snd_intel_sdw_acpi irqbypass snd_hda_codec snd_hda_core snd_hwdep crct10dif_pclmul snd_pcm crc32_pclmul ghash_clmulni_intel snd_timer snd aesni_intel mousedev r8169 mxm_wmi sp5100_tco soundcore radeon gpu_sched crypto_simd realtek fam15h_power pcspkr i2c_piix4 cryptd mdio_devres k10temp drm_ttm_helper libphy ttm wmi drm_dp_helper mac_hid acpi_cpufreq vboxnetflt(OE) vboxnetadp(OE) vboxdrv(OE) dm_multipath dm_mod sg crypto_user fuse bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 sr_mod cdrom serio_raw ata_generic atkbd raid1 firewire_ohci pata_acpi uas libps2 md_mod vivaldi_fmap crc32c_intel firewire_core usb_storage crc_itu_t pata_atiixp xhci_pci i8042 xhci_pci_renesas serio usbhid
[ 9.667850] CR2: 0000000000000000
[ 9.669612] ---[ end trace 0000000000000000 ]---
[ 9.671370] RIP: 0010:amdgpu_device_fini_sw+0x2a4/0x390 [amdgpu]
[ 9.673577] Code: 82 00 00 00 48 89 df e8 da a2 04 00 48 83 bb e8 5f 00 00 00 74 08 48 89 df e8 68 42 04 00 48 8b bb f0 74 01 00 b8 ff ff ff ff <f0> 0f c1 07 83 f8 01 74 4c 85 c0 0f 8e c3 00 00 00 48 c7 83 f0 74
[ 9.677184] RSP: 0018:ffffba0ec0e17b28 EFLAGS: 00010246
[ 9.678979] RAX: 00000000ffffffff RBX: ffff9c31c3e00000 RCX: 0000000000000000
[ 9.680774] RDX: 00000000000305c0 RSI: 0000000000000000 RDI: 0000000000000000
[ 9.682545] RBP: ffff9c31c3e00010 R08: ffffba0ec0e17ba8 R09: ffff9c31c01cd9d0
[ 9.684256] R10: 000000000000002a R11: ffff9c31c5d3c6d8 R12: ffffba0ec0e17ba8
[ 9.685980] R13: ffff9c31c10560d0 R14: ffff9c31c1056374 R15: ffffba0ec0e17db0
[ 9.687661] FS: 00007febac4b6080(0000) GS:ffff9c38becc0000(0000) knlGS:0000000000000000
[ 9.689394] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9.691095] CR2: 0000000000000000 CR3: 0000000106b2a000 CR4: 00000000000406e0
[ 9.695698] SVM: TSC scaling supported
[ 9.697388] kvm: Nested Virtualization enabled
[ 9.699132] SVM: kvm: Nested Paging enabled
[ 9.700816] SVM: LBR virtualization supported
[ 9.763852] MCE: In-kernel MCE decoding enabled.