Re: [PROBLEM] c5.metal on AWS fails to kexec after "PCI: Explicitly put devices into D0 when initializing"
From: Mario Limonciello
Date: Tue Feb 17 2026 - 09:37:34 EST
On 2/13/26 1:26 PM, Bjorn Helgaas wrote:
[+cc Brian, author of 51c0996dadae ("PCI/PM: Prevent runtime suspend
until devices are fully initialized") (though it's not clear to me
whether this makes any difference)]
On Fri, Feb 13, 2026 at 06:54:13PM +1300, Matthew Ruffell wrote:
Hi Mario,
I have been doing some experiments with c5.metal on AWS with an
Ubuntu resolute userspace and a 7.0 merge window kernel.
Amazon AMI:
ami-04080dce0ccaa893d
Linux HEAD 582a1ef360a05bff4350bbf6e383f61d26b804f0 (Linus's tree).
$ sudo lspci
https://paste.ubuntu.com/p/mtgbn2HJtW/
$ sudo lspci -vvv -t
https://paste.ubuntu.com/p/s6NhBz5FZy/
I noticed that nvme_pci_enable() calls pci_set_master() with the pci
device, and I wondered, is this called with the NVMe device, so I
went and had a look.
I added your patch:
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 595ec12d85df..bd116dccd897 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1300,6 +1300,7 @@ int pci_power_up(struct pci_dev *dev)
bool need_restore;
pci_power_t state;
u16 pmcsr;
+ u16 old_cmd;
platform_pci_set_power_state(dev, PCI_D0);
@@ -1347,6 +1348,10 @@ int pci_power_up(struct pci_dev *dev)
udelay(PCI_PM_D2_DELAY);
end:
+ pci_read_config_word(dev, PCI_COMMAND, &old_cmd);
+ pci_info(dev, "Bus mastering bit is %sabled in D0\n",
+ (old_cmd & PCI_COMMAND_MASTER) ? "en" : "dis");
+
dev->current_state = PCI_D0;
if (need_restore)
return 1;
as well as some additional logging in nvme_pci_enable:
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 80df992d1ae8..2d1898016d40 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2992,6 +2992,7 @@ static bool nvme_pci_update_nr_queues(struct
nvme_dev *dev)
static int nvme_pci_enable(struct nvme_dev *dev)
{
+ dev_info(dev->ctrl.device, "mruffell: nvme_pci_enable() start/n");
int result = -ENOMEM;
struct pci_dev *pdev = to_pci_dev(dev->dev);
unsigned int flags = PCI_IRQ_ALL_TYPES;
@@ -2999,6 +3000,7 @@ static int nvme_pci_enable(struct nvme_dev *dev)
if (pci_enable_device_mem(pdev))
return result;
+ pci_info(pdev, "mruffell: nvme_pci_enable() set master\n");
pci_set_master(pdev);
if (readl(dev->bar + NVME_REG_CSTS) == -1) {
@@ -3064,11 +3066,13 @@ static int nvme_pci_enable(struct nvme_dev *dev)
result = nvme_pci_configure_admin_queue(dev);
if (result)
goto free_irq;
+ pci_info(pdev, "mruffell: nvme_pci_enable() end\n");
return result;
free_irq:
pci_free_irq_vectors(pdev);
disable:
+ pci_info(pdev, "mruffell: nvme_pci_enable() disable\n");
pci_disable_device(pdev);
return result;
}
So, booting the system on a clean reboot, we see nvme_pci_enable
does infact get called with the nvme device, and all PCI devices in
D0 with the bus master bit enabled.
https://paste.ubuntu.com/p/zp3khwfMg2/
Then I did a kexec.
https://paste.ubuntu.com/p/SWB5jz6x4g/
nvme_pci_enable() still gets called, but it doesn't seem to change
anything. The system still fails to issue I/O to the NVMe and all
pci devices say the bus mastering bit is disabled in D0.
I then reverted
commit 51c0996dadaea20d73eb0495aeda9cb0422243e8
Author: Brian Norris <briannorris@xxxxxxxxxxxx>
Date: Thu Jan 22 09:48:15 2026 -0800
Subject: PCI/PM: Prevent runtime suspend until devices are fully initialized
commit 907a7a2e5bf40c6a359b2f6cc53d6fdca04009e0
Author: Mario Limonciello <mario.limonciello@xxxxxxx>
Date: Wed Jun 11 18:31:16 2025 -0500
Subject: PCI/PM: Set up runtime PM even for devices without PCI PM
commit 4d4c10f763d7808fbade28d83d237411603bca05
Author: Mario Limonciello <mario.limonciello@xxxxxxx>
Date: Wed Apr 23 23:31:32 2025 -0500
Subject: PCI: Explicitly put devices into D0 when initializing
and repeated the tests.
This time, on a clean reboot I get:
https://paste.ubuntu.com/p/qm6ZBWNqG5/
I get no messages about bus mastering at all. I did a kexec reboot
and the system comes up correctly, again with no messages about bus
mastering. So we must return early before the printout.
I then added some more logging, to see if we can see what state
these devices are in:
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index bd116dccd897..deed439af6e4 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1302,10 +1302,16 @@ int pci_power_up(struct pci_dev *dev)
u16 pmcsr;
u16 old_cmd;
+ state = platform_pci_get_power_state(dev);
+ pci_read_config_word(dev, PCI_COMMAND, &old_cmd);
+ pci_info(dev, "Initial: Bus mastering bit is %sabled in %d\n",
+ (old_cmd & PCI_COMMAND_MASTER) ? "en" : "dis", state);
If you include "%s", paired with "__func__", in the printf string,
we'll get the locations directly.
platform_pci_set_power_state(dev, PCI_D0);
if (!dev->pm_cap) {
state = platform_pci_get_power_state(dev);
+ pci_info(dev, "mruffell: pci power state is %d\n", state);
if (state == PCI_UNKNOWN)
dev->current_state = PCI_D0;
else
@@ -1349,7 +1355,7 @@ int pci_power_up(struct pci_dev *dev)
end:
pci_read_config_word(dev, PCI_COMMAND, &old_cmd);
- pci_info(dev, "Bus mastering bit is %sabled in D0\n",
+ pci_info(dev, "mruffell: final: Bus mastering bit is %sabled in D0\n",
(old_cmd & PCI_COMMAND_MASTER) ? "en" : "dis");
dev->current_state = PCI_D0;
The result for a clean reboot is:
https://paste.ubuntu.com/p/qh84c8T9ND/
We seem to skip pci_power_up() entirely for most devices, including
the NVMe drive. Those that enter pci_power_up() are in state 5, or
PCI_UNKNOWN.
I did a kexec, and we get:
https://paste.ubuntu.com/p/D2GhjKxDG3/
Pretty much identical results, and the system comes up correctly.
Each time the bus mastering printout at the very end of
pci_power_up() is never called.
The only conclusions I can seem to draw from this is when "PCI:
Explicitly put devices into D0 when initializing" is present, all
devices get their bus mastering bit, and are placed into D0, and on
kexec, all devices lose their bus mastering bit and never get it
back again.
If we revert "PCI: Explicitly put devices into D0 when
initializing", devices that work bypass pci_power_up(), get their
bus mastering bits set correctly, and kexec works.
Thanks for all your instrumentation and debugging!
I don't think it's safe for the PCI core to indiscriminately enable
bus mastering during boot. That allows devices to DMA to/from system
memory even if there's no driver to operate the device. Bus mastering
should only be enabled by a driver.
It seems to me the problem is a mismatch with the device and what the kernel did in the context of kexec.
I wonder if instead we should do the inverse - explicitly turn off bus mastering before we put it into D0 at startup. Because as it stands today there appears to be some devices that the BIOS will have bus mastering enabled, kexec might have turned on bus mastering and we are just "getting lucky" there hasn't been a problem from that.
This would then also align the kernel behavior to the design pattern of bus mastering should always be off until a driver asks for it.
Per D2GhjKxDG3, nvme_pci_enable() is being called after the kexec, as
it should be. We need to figure out why it isn't enabling bus
mastering.
It looks like something is wrong before we even get to
nvme_pci_enable() because it took 80+ seconds to get there after
kexec (D2GhjKxDG3), vs 2 seconds for the clean reboot (qh84c8T9ND).
Can you please put the entire dmesg log in a pastebin, possibly also
with a note about the circumstances and a pointer to another pastebin
with the "git diff" from upstream to the kernel? It's hard for me to
pull all the pieces together from the excerpts and keep track of which
logs correspond to reboots vs kexecs, etc.
Bjorn