RE: [RESEND] Handle MPS mismatch for Switch Downstream Ports

From: Devilliv Kelly

Date: Fri Apr 03 2026 - 10:06:14 EST


On Thursday, April 2, 2026 5:38 AM, Bjorn Helgaas wrote:
> Thanks for the report, and sorry we missed your original email; I couldn't find it
> in the lore archives, so maybe it got lost in transit.

Thank you for your reply. The original email was lost, probably because
I mistakenly used a non-plain-text format, sorry.

> On Tue, Mar 31, 2026 at 04:10:56AM +0000, Devilliv Kelly wrote:
> > Background
> > ===========
> > Commit 9f0e89359775 ("PCI: Match Root Port's MPS to endpoint's MPSS as
> > necessary") added logic to reduce a Root Port's MPS when an endpoint's
> > MPSS is smaller than the Root Port's current MPS setting. This ensures
> > hot-added devices can work correctly.
> >
> > However, this logic only applies to ROOT_PORT type bridges:
> >
> > mpss = 128 << dev->pcie_mpss;
> > if (mpss < p_mps && pci_pcie_type(bridge) ==
> PCI_EXP_TYPE_ROOT_PORT) {
> > pcie_set_mps(bridge, mpss);
> > ...
> > }
> >
> > This leaves Switch Downstream Ports unhandled, which can cause issues
> > when the Switch reports an incorrect or unexpected MPS value after
> > secondary bus reset.
> >
> > Problem Description
> > ===================
> > We encountered a scenario where a PCIe Switch Downstream Port reports
> > an MPS value larger than what the endpoint can support:
> >
> > Topology:
> > 16:00.0 - Switch Upstream Port (MPS = 512 bytes, correct)
> > └── 17:00.0 - Switch Downstream Port (MPS = 2048 bytes after
> secondary bus reset)
> > └── 18:00.0 - Endpoint device (DevCap MaxPayload = 512
> > bytes)
> >
> > After a secondary bus reset, the Switch Downstream Port's MPS
> > unexpectedly became 2048 bytes. When the kernel enumerates the
> > endpoint device (18:00.0), it attempts to set the endpoint's MPS to
> > 2048 to match the upstream bridge, but this fails because the endpoint only
> supports a maximum of 512 bytes.
> >
> > Kernel log shows:
> > pci 0000:18:00.0: can't set Max Payload Size to 2048; if necessary,
> > use "pci=pcie_bus_safe" and report a bug
>
> Can you please collect the complete dmesg log when booted with the
> 'dyndbg="file drivers/pci/* +p"' kernel parameter? (The double quotes are a
> necessary part of the parameter)

This suggestion was very effective and provided many debugging clues, please
refer to the following dmesg log.

> How do you initiate the reset and which device is being reset? What caused
> the subsequent enumeration?
>
> My guess is you used setpci to set the Secondary Bus Reset bit in the
> 16:00.0 Bridge Control register? And maybe you used a sysfs "rescan"
> file to enumerate the endpoint?
>
> If you used a sysfs reset interface or a driver called pci_reset_function(), the
> kernel should have saved and restored config space so the 17:00.0 MPS shouldn't
> change unexpectedly. Also, the kernel would only let you set SBR in 16:00.0 if
> there was a single device on bus 17, and switches typically have multiple
> downstream ports.

The fully Topology is:
+-[0000:11]-+-00.0 Intel Corporation Device 09a2
| \-01.0-[12-23]----00.0-[13-23]--+-00.0-[14]----00.0 Device abcd:000d
| +-01.0-[15]----00.0 Device abcd:000d
| +-02.0-[16-1e]----00.0-[17-1e]--+-00.0-[18]----00.0 Device abcd:000d
| | +-01.0-[19]----00.0 Device abcd:000d
| | +-02.0-[1a]----00.0 Device abcd:000d
| | +-03.0-[1b]----00.0 Device abcd:000d
| +-03.0-[1f]----00.0 Device abcd:000d
| +-04.0-[20]----00.0 Device abcd:000d

Device id 'abcd' is my device, and the issue described above only occurs
in 17:{00/01/02/03}.0 downstream ports, the cascaded bridge, while
13:{00/01/03/04}.0 downstream ports all work fine.

If I use sysfs and setpci to trigger Secondary Bus Reset for 17:00.0
for example:

echo 1 > /sys/devices/pci0000\:11/0000\:11\:01.0/0000\:12\:00.0/0000\:13\:02.0/0000\:16\:00.0/0000\:17\:00.0/0000\:18\:00.0/remove
setpci -s 17:00.0 BRIDGE_CONTROL=0x40:0x40
setpci -s 17:00.0 BRIDGE_CONTROL=0x00:0x40
echo 1 > /sys/devices/pci0000\:11/0000\:11\:01.0/0000\:12\:00.0/0000\:13\:02.0/0000\:16\:00.0/0000\:17\:00.0/rescan

In that way, the kernel would have saved and restored config space of 17:00.0
as you expected, so the mps would be restored from 2048 to 512 before rescanning
bus, and everything works fine later.

[ 5200.542953] pci 0000:18:00.0: PME# disabled
[ 5200.543144] pcieport 0000:17:00.0: saving config space at offset 0x0 (reading 0xc0301000)
...(saving at offset 0x4-0x38)
[ 5200.543391] pcieport 0000:17:00.0: saving config space at offset 0x3c (reading 0x30000)
[ 5200.543711] pci 0000:18:00.0: Removing from iommu group 67
[ 5200.543880] pci 0000:18:00.0: device released
[ 5200.543924] pcieport 0000:17:00.0: PME# enabled
[ 5277.558200] pci_bus 0000:17: scanning bus
[ 5277.575971] pcieport 0000:17:00.0: restoring config space at offset 0x2c (was 0x1530, writing 0x1530)
[ 5277.577655] pcieport 0000:17:00.0: restoring config space at offset 0x28 (was 0x1510, writing 0x1510)
[ 5277.577682] pcieport 0000:17:00.0: restoring config space at offset 0x24 (was 0x910001, writing 0x910001)
[ 5277.578015] pcieport 0000:17:00.0: PME# disabled
[ 5277.578027] pcieport 0000:17:00.0: scanning [bus 18-18] behind bridge, pass 0
[ 5277.578083] pci_bus 0000:18: scanning bus
[ 5277.578838] pci 0000:18:00.0: [abcd:000d] type 00 class 0x030200
...
[ 5277.580012] pci 0000:18:00.0: Max Payload Size set to 512 (was 128, max 512)
[ 5277.584015] pci 0000:18:00.0: PME# supported from D0 D3hot
[ 5277.584066] pci 0000:18:00.0: PME# disabled

If I trigger reset in my driver, sequence is almost the same as above:

pci_stop_and_remove_bus_device_locked(dev);

pm_runtime_get_sync(&bridge->dev);
pci_bridge_secondary_bus_reset(bridge);

pci_lock_rescan_remove();
pci_rescan_bus(bridge->bus);
pci_unlock_rescan_remove();

pm_runtime_put(&bridge->dev);

then it will cause the problem, because pm_runtime_get_sync() will trigger
config space restoring before Secondary Bus Reset, so the unexcepted 2048
mps after reset will remain unchanged till the end of the rescanning process.

[ 9768.822616] pci 0000:18:00.0: PME# disabled
[ 9768.823030] pcieport 0000:17:00.0: saving config space at offset 0x0 (reading 0xc0301000)
...(saving at offset 0x4-0x38)
[ 9768.823252] pcieport 0000:17:00.0: saving config space at offset 0x3c (reading 0x30000)
[ 9768.823731] pci 0000:18:00.0: Removing from iommu group 67
[ 9768.823773] pcieport 0000:17:00.0: PME# enabled
[ 9768.823987] pci 0000:18:00.0: device released
[ 9769.855833] pcieport 0000:17:00.0: restoring config space at offset 0x2c (was 0x1530, writing 0x1530)
[ 9769.855866] pcieport 0000:17:00.0: restoring config space at offset 0x28 (was 0x1510, writing 0x1510)
[ 9769.855894] pcieport 0000:17:00.0: restoring config space at offset 0x24 (was 0x910001, writing 0x910001)
[ 9769.858141] pcieport 0000:17:00.0: PME# disabled
[ 9770.891245] pci_bus 0000:17: scanning bus
[ 9770.891435] pcieport 0000:17:00.0: scanning [bus 18-18] behind bridge, pass 0
[ 9770.891492] pci_bus 0000:18: scanning bus
[ 9770.893557] pci 0000:18:00.0: [abcd:000d] type 00 class 0x030200
...
[ 9770.894670] pci 0000:18:00.0: can't set Max Payload Size to 2048; if necessary, use "pci=pcie_bus_safe" and report a bug
[ 9770.899584] pci 0000:18:00.0: PME# supported from D0 D3hot
[ 9770.899637] pci 0000:18:00.0: PME# disabled

As mentioned above, I am not sure if I make use of pm_runtime_get_sync()
correctly, refered to drivers/pci/probe.c:

static int pci_scan_bridge_extend()
{
...
/*
* Make sure the bridge is powered on to be able to access config
* space of devices below it.
*/
pm_runtime_get_sync(&dev->dev);
pci_read_config_dword(dev, PCI_PRIMARY_BUS, &buses);
...
}

Or if the Secondary Bus Reset sequence in my driver was correct?

> > This results in NMI errors when the endpoint attempts DMA transactions:
> > Uhhuh. NMI received for unknown reason 2c on CPU 0.
> > Dazed and confused, but trying to continue
> >
> > Root Cause
> > ==========
> > The pci_configure_mps() function only adjusts the upstream bridge's
> > MPS when the bridge is a ROOT_PORT. For DOWNSTREAM_PORT types (Switch
> > ports), the kernel attempts to set the endpoint's MPS to the bridge's
> > value without checking if the endpoint can support it.
> >
> > While the Switch firmware should ideally configure correct MPS values,
> > the kernel should be robust enough to handle such cases and ensure
> > proper MPS configuration for reliable operation.
>
> Agreed. I think the SBR *should* reset MPS to the default value of 000b (128
> bytes), but maybe this switch doesn't work that way.
> Regardless, I agree that Linux should handle this better.

Even if the SBR *should* reset MPS to the default value of 000b (128 bytes),
the same problem still seems to occur in my circumstance as the MPS won't
restore back to 512 again.

So I wonder if I miss something.

> > Current Behavior
> > ================
> > 1. Endpoint's MPSS < Bridge's MPS
> > 2. Bridge is DOWNSTREAM_PORT (not ROOT_PORT) 3. Kernel skips bridge
> > MPS adjustment 4. pcie_set_mps(dev, p_mps) fails because p_mps > dev's
> > capability 5. Device may not function correctly
> >
> > Workaround
> > ==========
> > The issue can be worked around by using the kernel parameter:
> > pci=pcie_bus_safe
> >
> > However, this affects the entire system and may reduce performance for
> > other devices.
> >
> > Questions for Discussion
> > ========================
> > 1. Was there a specific reason for restricting this logic to ROOT_PORT
> > only? The commit message mentions avoiding impact on "other unrelated
> > sub-topologies," but Switch Downstream Ports typically only have one
> > endpoint below them.
> >
> > 2. Should we also consider propagating MPS changes up through multiple
> > Switch levels in the hierarchy?
> >
> > References
> > ==========
> > - Commit 9f0e89359775: PCI: Match Root Port's MPS to endpoint's MPSS
> > as necessary
> > - Commit 27d868b5e6cf: PCI: Set MPS to match upstream bridge
> > - https://bugzilla.kernel.org/show_bug.cgi?id=200527 (original
> > ROOT_PORT fix)
> >
> > Kelly Devilliv