Re: Regression from dcadfd7f7c74ef9ee415e072a19bdf6c085159eb

From: Mario Limonciello
Date: Tue Dec 05 2023 - 15:04:35 EST


On 12/3/2023 06:29, Takashi Sakamoto wrote:
Hi Mario,

Thanks for the advices.

I note that In my experiments I use Ubuntu 23.04 amd64 (v6.2 kernel) with
backported FireWire stack[1]. Except for the stack, the kernel and software
packages can be retrieved from repositories of Ubuntu project.

On Tue, Nov 28, 2023 at 12:09:41AM -0600, Mario Limonciello wrote:
On 11/27/2023 23:24, Takashi Sakamoto wrote:
Hi Mario

Following up on our last conversation, I purchase some hardware to
attempt to retrieve outputs from serial port. Finally, I bought another
mother board in used market which provides serial port from Super I/O
chip (ASUS TUF Gaming X570-Plus). However, I have retrieved no helpful
outputs yet when encountering the system reboot.

Did you up the loglevel to 8 to make sure you'll get all kernel output on
the serial port, not just errors?

Even if giving either 'debug' cmdline option or incrementing console
loglevel via syctl, I receive no useful output from console when loading
the module at or after booting up.

```
$ sysctl kernel.printk
kernel.printk = 7 7 1 7
```

I tried at several difference cases; enabling/disabling IOMMU,
enabling/disabling SVM in motherboard level. But nothing effective.

As you mentioned, I check whether PCIe AER is enabled or not in the
running kernel (Ubuntu 23.04 linux-image-6.2.0-37-generic). It is
certainly enabled, however I can see nothing in the output as I noted.

I experienced extra troubles relevant to AMD Ryzen machine and the issued
PCIe device:

* ASRock X570 Phantom Gaming 4 with AMD Ryzen 5 3600X does not detect
the card. We can see no corresponding entry in lspci.
* After associating the card to vfio-pci, lspci command can reboot the
system even if firewire-ohci driver is not loaded. I can regenerate it
in both Gigabyte AX370-Gaming 5/ASUS TUF Gaming X570-plus with AMD
Ryzen 2400G.

Rather than lspci, is it specifically config space access from sysfs? Does
the output from the serial port change with IOMMU enabled vs disabled?

In lspci case, I can work with debugger and figure out that 'pread(2)' to
file descriptor for 'config' node in sysfs causes the unexpected system
reboot. Additionally I can regenerate it by hexdump(1) to the node:

OK - is this by chance related to access to PCI extended config space failing for this device then? If you read just the first 256 bytes it's ok, but beyond that it fails?

If so, can you please try to reproduce using this series from Bjorn applied:
https://lore.kernel.org/r/20231121183643.249006-1-helgaas@xxxxxxxxxx

And then add this to kernel command line:
efi=debug "dyndbg=file arch/x86/pci/* +p"

Capture the dmesg and share it.


```
$ lspci
...
04:00.0 PCI bridge: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge [1b21:1080] (rev 03)
05:00.0 FireWire (IEEE 1394): VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller [1106:3044] (rev 80)
...
$ hexdump -C /sys/bus/pci/devices/0000\:05\:00.0/config
00000000 06 11 44 30 80 00 10 02 80 10 00 0c 10 20 00 00 |..D0......... ..|
00000010 00 00 90 fc 01 d0 00 00 00 00 00 00 00 00 00 00 |................|
00000020 00 00 00 00 00 00 00 00 00 00 00 00 06 11 44 30 |..............D0|
00000030 00 00 00 00 50 00 00 00 00 00 00 00 ff 01 00 20 |....P.......... |
00000040

$ lsmod | grep firewire
(no output)

$ sudo -i
# modprobe vfio-pci
# echo 1106 3044 > /sys/bus/pci/drivers/vfio-pci/new_id
# exit

$ hexdump -C /sys/bus/pci/devices/0000\:05\:00.0/config
(reboot)
```

Can you access config space for other PCIe devices successfully on this system?
Specifically extended config space?


I can suppress it when disabling IOMMU in motherboard. In this point, the
issue of lspci is a bit different from the issue of driver issue.

I'm plreased to see if you have extra ideas to get helpful output from
the system. But I guess that I should start finding some workaround to
avoid the issued access to register instead of investigating the reboot
mechanism, sigh...

Anyway, thanks for your help. >

Can you check FCH::PM::S5_RESET_STATUS on next boot after failure has
occurred? It is available at MMIO FED80300 or through indirect IO access at
0xC0.

If MMIO doesn't work, double check FCH::PM_ISACONTROL bit 1 (described on
page 296) to confirm if your system enables it.

The meanings of the different bits can be found in a recent PPR:
https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/55901_B1_pub_053.zip

Indirect IO is described on PDF page 294.

This will at least give us a hint what's going on in this case.

I'll try the above in this week. Thanks.


[1] https://github.com/takaswie/linux-firewire-dkms/

Takashi Sakamoto