PROBLEM: Possible bug in AMDGPU DC code?
From: Jordan Henderson
Date: Tue Dec 19 2017 - 00:47:32 EST
Hello,
I have not posted to LKML before, so I apologize if this is a cumbersome area to place this message.
I purchased the recently-released HP envy x360 laptop which has a Ryzen 2500U APU with a Vega 10 GPU. After setting up Slackware on the laptop, I compiled kernel 4.15-rc2 while enabling the AMDGPU DC code to try and test out the current functionality. The result is that most of the time, the boot process seems to get hung at
"Switching to amdgpudrmfb from EFI VGA"
Very rarely the boot will succeed and everything seems to go smoothly. Adding "nomodeset" to the kernel parameters causes the boot to always succeed, at the cost of course of disabling amdgpu from working correctly, since it requires modesetting.
I have also tried the same process within Ubuntu 17.10 and also using kernels 4.15-rc3 and 4.15-rc4 with the same results. The only way I was able to capture system output which seemed relevant was by blacklisting amdgpu and then modprobing it once in my desktop environment, which promptly caused my system to freeze, but seemed to reveal some information about an MCE hardware error. Unfortunately it seems mcelog doesn't support Ryzen yet, so I can't retrieve any useful information that way. However, /var/log/syslog did seem to cough up a little bit more, specifically:
Dec 19 04:23:44 darkstar kernel: [ 1139.605187] amdgpu 0000:03:00.0: [mmhub] VMC page fault (src_id:0 ring:153 vm_id:0 pas_id:0)
Dec 19 04:23:44 darkstar kernel: [ 1139.605191] amdgpu 0000:03:00.0: at page 0x0000000000000000 from 18
Dec 19 04:23:44 darkstar kernel: [ 1139.605193] amdgpu 0000:03:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Dec 19 04:23:44 darkstar kernel: [ 1139.605206] [Hardware Error]: Deferred error, no action required.
Dec 19 04:23:44 darkstar kernel: [ 1139.605212] [Hardware Error]: CPU:0 (17:11:0) MC20_STATUS[-|-|MiscV|-|AddrV|Deferred|-|SyndV|-|UECC]: 0x9c2030000001085b
Dec 19 04:23:44 darkstar kernel: [ 1139.605218] [Hardware Error]: Error Addr: 0x00007ffcffffff00
Dec 19 04:23:44 darkstar kernel: [ 1139.605220] [Hardware Error]: IPID: 0x0000002e00000000, Syndrome: 0x000000005b240205
Dec 19 04:23:44 darkstar kernel: [ 1139.605224] [Hardware Error]: Coherent Slave Extended Error Code: 1
Dec 19 04:23:44 darkstar kernel: [ 1139.605225] [Hardware Error]: Coherent Slave Error: Address violation.
Dec 19 04:23:44 darkstar kernel: [ 1139.605228] [Hardware Error]: cache level: L3/GEN, mem/io: IO, mem-tx: IRD, part-proc: SRC (no timeout)
which at least appear to be related.
As I have not heard much else in the way of issues using the AMDGPU DC code, I believe that this is a problem localized to this particular laptop/BIOS/hardware configuration. Using the modprobe method, I have attached everything that I have been able to capture up to the system hang which I believe is relevant or which has been suggested by the bug reporting FAQ; please let me know if there is more information that would be useful.
Attachment:
cpuinfo
Description: cpuinfo
Attachment:
dmesg
Description: dmesg
Attachment:
iomem
Description: iomem
Attachment:
ioports
Description: ioports
Attachment:
lspci
Description: lspci
Attachment:
messages
Description: messages
Attachment:
modules
Description: modules
Attachment:
scsi
Description: scsi
Attachment:
syslog
Description: syslog
Attachment:
ver_linux
Description: ver_linux