Re: [3.1-rc4] Bus Fatal Error caused by "PCI: Set PCI-E Max Payload Size on fabric"

From: Sven Schnelle
Date: Thu Sep 08 2011 - 02:42:20 EST


Jon Mason <mason@xxxxxxxx> writes:

> On Wed, Sep 7, 2011 at 1:58 PM, Simon Kirby <sim@xxxxxxxxxx> wrote:
>> On Wed, Sep 07, 2011 at 01:57:28PM -0700, Jon Mason wrote:
>>
>>> On Wed, Sep 7, 2011 at 1:47 PM, Simon Kirby <sim@xxxxxxxxxx> wrote:
>>> > On Wed, Sep 07, 2011 at 12:18:59PM -0700, Simon Kirby wrote:
>>> >
>>> >> On Wed, Sep 07, 2011 at 10:44:32AM -0700, Jesse Barnes wrote:
>>> >>
>>> >> > On Wed, 7 Sep 2011 12:52:25 -0400
>>> >> > Josh Boyer <jwboyer@xxxxxxxxx> wrote:
>>> >> >
>>> >> > > On Wed, Sep 7, 2011 at 12:22 PM, Sven Schnelle <svens@xxxxxxxxxxxxxx>
>>> >> > > wrote:
>>> >> > > > Simon Kirby <sim@xxxxxxxxxx> writes:
>>> >> > > >
>>> >> > > >> Hello!
>>> >> > > >>
>>> >> > > >> Since trying 3.1-rc4 on a few Dell servers, all of them have
>>> >> > > >> booted up with the amber error LED lit. "ipmitool sel list" shows:
>>> >> > > >>
>>> >> > > >> ?? ??1 | 09/06/2011 | 17:21:56 | Event Logging Disabled #0x72 | Log
>>> >> > > >> area reset/cleared | Asserted 2 | 09/06/2011 | 17:25:38 | Critical
>>> >> > > >> Interrupt #0x18 | Bus Fatal Error | Asserted 3 | 09/06/2011 |
>>> >> > > >> 17:25:38 | Unknown #0x1a | 4 | 09/06/2011 | 17:25:38 | Unknown
>>> >> > > >> #0x1a |
>>> >> > > >
>>> >> > > > I'm seeing exact the same issue on a Dell 1950 Server. If anyone
>>> >> > > > wants me to try additional debugging/patches, feel free to do
>>> >> > > > so. Unfortunately i don't have the time/knowledge to debug that by
>>> >> > > > myself.
>>> >> > >
>>> >> > > I thought Jesse or Jon had a revert or partial fix queued up to send
>>> >> > > to Linus, but I don't see anything in or post -rc5 yet. ?That was
>>> >> > > indicated in https://bugzilla.kernel.org/show_bug.cgi?id=42162
>>> >> > >
>>> >> > > Jesse, Jon?
>>> >> >
>>> >> > kernel.org is still down and I haven't pushed anything to github. ?I
>>> >> > asked Jon to send his patch directly to Linus today instead.
>>> >>
>>> >> FWIW, this patch didn't seem to fix it:
>>> >> https://bugzilla.kernel.org/attachment.cgi?id=71222
>>> >>
>>> >> dmesg used to say:
>>> >>
>>> >> pci 0000:00:02.0: Dev MPS 128 MPSS 256 MRRS 128
>>> >> pci 0000:00:02.0: Dev MPS 256 MPSS 256 MRRS 128
>>> >> pci 0000:06:00.0: Dev MPS 128 MPSS 256 MRRS 4096
>>> >> pci 0000:06:00.0: Dev MPS 256 MPSS 256 MRRS 128
>>> >> pci 0000:07:00.0: Dev MPS 128 MPSS 256 MRRS 4096
>>> >> pci 0000:07:00.0: Dev MPS 256 MPSS 256 MRRS 128
>>> >> pci 0000:08:00.0: Dev MPS 128 MPSS 128 MRRS 128
>>> >> pci 0000:08:00.0: MPS configured higher than maximum supported by the device. ?If a bus issue occurs, try running with pci=pcie_bus_safe.
>>> >> pci 0000:08:00.0: Dev MPS 256 MPSS 256 MRRS 128
>>> >> Uhhuh. NMI received for unknown reason 21 on CPU 0.
>>> >> Do you have a strange power saving mode enabled?
>>> >> Dazed and confused, but trying to continue
>>> >
>>> > Ok, I commented out the "pcie_write_mps(dev, mps);" line and the error
>>> > stopped, but this made me realize that the pci=pcie_bus_safe option must
>>> > have been missing. It turns out I had hacked a custom grub entry to load
>>> > the newest kernel into grub instead of the one with the highest version
>>> > number (grumble), so the default kopt didn't apply there.
>>> >
>>> > So, pci=pcie_bus_safe DOES fix this case, and I've confirmed that the
>>> > MRRS-dissabling patch makes no difference in this case.
>>> >
>>> > Can we just make pci=pcie_bus_safe (as in previous behavior) the default,
>>> > or make it not change where it would otherwise warn, or does that
>>> > basically make the thing useless?
>>>
>>> I have a patch that does does pcie_bus_safe as the default behavior
>>> and does not modify the MRRS. ÂWould you be willing to test this patch
>>> for me?
>>
>> Sure, of course. (It compiles, ship it. :))
>
> Great, thanks! I've attached a patch file to this e-mail.

Thanks, Jon. Works my system (Dell 1950).

Tested-by: Sven Schnelle <svens@xxxxxxxxxxxxxx>

Regards

Sven
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/