Re: [PROBLEM] c5.metal on AWS fails to kexec after "PCI: Explicitly put devices into D0 when initializing"

From: Mario Limonciello

Date: Thu Dec 04 2025 - 00:29:47 EST

On 12/3/2025 11:04 PM, Matthew Ruffell wrote:

Hi Mario,

I thank you for your prompt reply, and apologise for my delayed reply.
Answers inline.

When you say AWS specific patches, can you be more specific? What is
missing from a mainline kernel to use this hardware? IE; how do I know
there aren't Ubuntu specific patches *causing* this issue.

I can reproduce the issue with the current HEAD of Linus's tree, with no
additional patches applied. My current HEAD for testing is the 6.19 merge
window, commit 51ab33fc0a8bef9454849371ef897a1241911b37.
To get the mainline build to work on c5.metal on AWS I needed to edit a few
config parameters, and I have attached the config I used.

Now I've never used AWS - do you have an opportunity to do "regular"
reboots, or only kexec reboots?

This issue only happens with a kexec reboot, right?

We can do regular and kexec reboots with the c5.metal instance type. The issue
only happens with a kexec reboot.

The first thing that jumps out at me is the code in
pci_device_shutdown() that clears bus mastering for a kexec reboot.
If you comment that out what happens?

I commented out the code that clears bus mastering, diff below, and kexec boots
correctly now, and the NVME drive appears just as it did before
"4d4c10f PCI: Explicitly put devices into D0 when initializing".

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 302d61783f6c..0cb14ff32475 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -517,8 +517,9 @@ static void pci_device_shutdown(struct device *dev)
* If it is not a kexec reboot, firmware will hit the PCI
* devices with big hammer and stop their DMA any way.
*/
- if (kexec_in_progress && (pci_dev->current_state <= PCI_D3hot))
- pci_clear_master(pci_dev);
+/* if (kexec_in_progress && (pci_dev->current_state <= PCI_D3hot))
+ * pci_clear_master(pci_dev);
+ */
}

#ifdef CONFIG_PM_SLEEP

Since this works, does that mean that the bus master bit isn't being set on the
NVME device on the other side of kexec?

That's at least what it seems like. And I guess trying to set D0 without bus mastering enabling is causing a problem.

Could you try adding a pci_set_master() call to pci_power_up()? This is what I have in mind (only compile tested):

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index b14dd064006c..68661e333032 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1323,6 +1323,7 @@ int pci_power_up(struct pci_dev *dev)
return -EIO;
}

+ pci_set_master(dev);
pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr);
if (PCI_POSSIBLE_ERROR(pmcsr)) {
pci_err(dev, "Unable to change power state from %s to D0, device inaccessible\n",

The next thing I would wonder if if you're compiling with
CONFIG_KEXEC_JUMP and if that has an impact to your issue. When this is
defined there is a device suspend sequence in kernel_kexec() that is run
which will run various suspend related callbacks. Maybe the issue is
actually in one of those callbacks.

Yes, Ubuntu kernels set CONFIG_KEXEC_JUMP=y. I did a build with
CONFIG_KEXEC_JUMP=n and it has the same symptoms.

A possible way to determine this would be to run rtcwake to suspend and
resume and see if the drive survives. If it doesn't, it's a hint that
there is something going on with power management in this drive or the
bridge it's connected to. Maybe one of them isn't handling D3 very well.

Unfortunately, this c5.metal instance type doesn't support rtcwake with mode mem
or disk, as hibernation is disabled on these instance types. But since
CONFIG_KEXEC_JUMP=n doesn't help,

I'm going to add some debug statements to pci_device_shutdown() to see what
state the NVME device is in with and without
"4d4c10f PCI: Explicitly put devices into D0 when initializing".

Thanks,
Matthew

Thanks for the updates.

I have a relatively ignorant question. Can you reproduce with kdump and a crash too?

I don't actually know if you configure kdump and then crash the kernel (say magic sys-rq key), does pci_device_shutdown() get called in order to do the kexec? Or because the kernel is already in a crash state is there just a jump into the crash kernel image location?