Re: [PROBLEM] c5.metal on AWS fails to kexec after "PCI: Explicitly put devices into D0 when initializing"
From: Matthew Ruffell
Date: Fri Feb 13 2026 - 00:54:39 EST
Hi Mario,
I have been doing some experiments with c5.metal on AWS with an Ubuntu resolute
userspace and a 7.0 merge window kernel.
Amazon AMI:
ami-04080dce0ccaa893d
Linux HEAD 582a1ef360a05bff4350bbf6e383f61d26b804f0 (Linus's tree).
$ sudo lspci
https://paste.ubuntu.com/p/mtgbn2HJtW/
$ sudo lspci -vvv -t
https://paste.ubuntu.com/p/s6NhBz5FZy/
I noticed that nvme_pci_enable() calls pci_set_master() with the pci device, and
I wondered, is this called with the NVMe device, so I went and had a look.
I added your patch:
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 595ec12d85df..bd116dccd897 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1300,6 +1300,7 @@ int pci_power_up(struct pci_dev *dev)
bool need_restore;
pci_power_t state;
u16 pmcsr;
+ u16 old_cmd;
platform_pci_set_power_state(dev, PCI_D0);
@@ -1347,6 +1348,10 @@ int pci_power_up(struct pci_dev *dev)
udelay(PCI_PM_D2_DELAY);
end:
+ pci_read_config_word(dev, PCI_COMMAND, &old_cmd);
+ pci_info(dev, "Bus mastering bit is %sabled in D0\n",
+ (old_cmd & PCI_COMMAND_MASTER) ? "en" : "dis");
+
dev->current_state = PCI_D0;
if (need_restore)
return 1;
as well as some additional logging in nvme_pci_enable:
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 80df992d1ae8..2d1898016d40 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2992,6 +2992,7 @@ static bool nvme_pci_update_nr_queues(struct
nvme_dev *dev)
static int nvme_pci_enable(struct nvme_dev *dev)
{
+ dev_info(dev->ctrl.device, "mruffell: nvme_pci_enable() start/n");
int result = -ENOMEM;
struct pci_dev *pdev = to_pci_dev(dev->dev);
unsigned int flags = PCI_IRQ_ALL_TYPES;
@@ -2999,6 +3000,7 @@ static int nvme_pci_enable(struct nvme_dev *dev)
if (pci_enable_device_mem(pdev))
return result;
+ pci_info(pdev, "mruffell: nvme_pci_enable() set master\n");
pci_set_master(pdev);
if (readl(dev->bar + NVME_REG_CSTS) == -1) {
@@ -3064,11 +3066,13 @@ static int nvme_pci_enable(struct nvme_dev *dev)
result = nvme_pci_configure_admin_queue(dev);
if (result)
goto free_irq;
+ pci_info(pdev, "mruffell: nvme_pci_enable() end\n");
return result;
free_irq:
pci_free_irq_vectors(pdev);
disable:
+ pci_info(pdev, "mruffell: nvme_pci_enable() disable\n");
pci_disable_device(pdev);
return result;
}
So, booting the system on a clean reboot, we see nvme_pci_enable does infact
get called with the nvme device, and all PCI devices in D0 with the bus master
bit enabled.
https://paste.ubuntu.com/p/zp3khwfMg2/
Then I did a kexec.
https://paste.ubuntu.com/p/SWB5jz6x4g/
nvme_pci_enable() still gets called, but it doesn't seem to change anything.
The system still fails to issue I/O to the NVMe and all pci devices say the
bus mastering bit is disabled in D0.
I then reverted
commit 51c0996dadaea20d73eb0495aeda9cb0422243e8
Author: Brian Norris <briannorris@xxxxxxxxxxxx>
Date: Thu Jan 22 09:48:15 2026 -0800
Subject: PCI/PM: Prevent runtime suspend until devices are fully initialized
commit 907a7a2e5bf40c6a359b2f6cc53d6fdca04009e0
Author: Mario Limonciello <mario.limonciello@xxxxxxx>
Date: Wed Jun 11 18:31:16 2025 -0500
Subject: PCI/PM: Set up runtime PM even for devices without PCI PM
commit 4d4c10f763d7808fbade28d83d237411603bca05
Author: Mario Limonciello <mario.limonciello@xxxxxxx>
Date: Wed Apr 23 23:31:32 2025 -0500
Subject: PCI: Explicitly put devices into D0 when initializing
and repeated the tests.
This time, on a clean reboot I get:
https://paste.ubuntu.com/p/qm6ZBWNqG5/
I get no messages about bus mastering at all. I did a kexec reboot and the
system comes up correctly, again with no messages about bus mastering. So we
must return early before the printout.
I then added some more logging, to see if we can see what state these devices
are in:
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index bd116dccd897..deed439af6e4 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1302,10 +1302,16 @@ int pci_power_up(struct pci_dev *dev)
u16 pmcsr;
u16 old_cmd;
+ state = platform_pci_get_power_state(dev);
+ pci_read_config_word(dev, PCI_COMMAND, &old_cmd);
+ pci_info(dev, "Initial: Bus mastering bit is %sabled in %d\n",
+ (old_cmd & PCI_COMMAND_MASTER) ? "en" : "dis", state);
+
platform_pci_set_power_state(dev, PCI_D0);
if (!dev->pm_cap) {
state = platform_pci_get_power_state(dev);
+ pci_info(dev, "mruffell: pci power state is %d\n", state);
if (state == PCI_UNKNOWN)
dev->current_state = PCI_D0;
else
@@ -1349,7 +1355,7 @@ int pci_power_up(struct pci_dev *dev)
end:
pci_read_config_word(dev, PCI_COMMAND, &old_cmd);
- pci_info(dev, "Bus mastering bit is %sabled in D0\n",
+ pci_info(dev, "mruffell: final: Bus mastering bit is %sabled in D0\n",
(old_cmd & PCI_COMMAND_MASTER) ? "en" : "dis");
dev->current_state = PCI_D0;
The result for a clean reboot is:
https://paste.ubuntu.com/p/qh84c8T9ND/
We seem to skip pci_power_up() entirely for most devices, including the NVMe
drive. Those that enter pci_power_up() are in state 5, or PCI_UNKNOWN.
I did a kexec, and we get:
https://paste.ubuntu.com/p/D2GhjKxDG3/
Pretty much identical results, and the system comes up correctly. Each time
the bus mastering printout at the very end of pci_power_up() is never called.
The only conclusions I can seem to draw from this is when
"PCI: Explicitly put devices into D0 when initializing" is present, all devices
get their bus mastering bit, and are placed into D0, and on kexec, all devices
lose their bus mastering bit and never get it back again.
If we revert "PCI: Explicitly put devices into D0 when initializing", devices
that work bypass pci_power_up(), get their bus mastering bits set correctly,
and kexec works.
I'll keep trying to debug this. But I am welcome to suggestions on data to
gather.
Thanks,
Matthew