Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32

From: Bjorn Helgaas
Date: Tue Aug 11 2015 - 15:29:06 EST


On Mon, Aug 10, 2015 at 2:07 PM, Duc Dang <dhdang@xxxxxxx> wrote:
> On Mon, Aug 10, 2015 at 10:42 AM, Bjorn Helgaas <bhelgaas@xxxxxxxxxx> wrote:
>> On Mon, Aug 10, 2015 at 12:16 PM, Duc Dang <dhdang@xxxxxxx> wrote:
>>> On Monday, August 10, 2015, Bjorn Helgaas <bhelgaas@xxxxxxxxxx> wrote:
>>>>
>>>> On Fri, Jul 31, 2015 at 12:00 PM, Duc Dang <dhdang@xxxxxxx> wrote:
>>>> > On Wed, Jul 29, 2015 at 8:55 AM, Bjorn Helgaas <bhelgaas@xxxxxxxxxx>
>>>> > wrote:
>>>> >> On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote:
>>>> >>> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote:
>>>> >>
>>>> >>> > Do you have another PCIe card to try on the same reboot test on this
>>>> >>> > board?
>>>> >>>
>>>> >>> I've seen this on at least two Mellanox cards. I'm running similar
>>>> >>> tests
>>>> >>> on a different type of card now.
>>>> >>
>>>> >> FWIW, reboot tests on two machines with Mellanox cards failed, while
>>>> >> the
>>>> >> same test on a machine with a different proprietary card succeeded.
>>>> >
>>>> > Thanks, Bjorn.
>>>> >
>>>> > I don't have the same Mellanox card as yours, but I will also run
>>>> > similar reboot test to see if I hit the same issue with my card.
>>>>
>>>> Any more hints on this? Nothing has changed on my end, so of course
>>>> I'm still seeing this, always on machines with Mellanox, and never on
>>>> other machines. Could this be a hardware issue like a signal
>>>> integrity or margin issue? I don't know where to go from here because
>>>> I'm not a hardware person, and I don't know anything to do in
>>>> software.
>>>
>>>
>>> Hi Bjorn,
>>>
>>> I tried to run similar reboot tests on 2 different Mellanox cards (Connect-X
>>> family, one card has 2 10G interfaces, the other one has 1 port that
>>> supports InfiniBand) with U-Boot 1.15.12 and linux 4.2-rc5 and I did not see
>>> the crash that you encounterred.
>>>
>>> Did you check if your Mellanox cards have latest firmware? I did see some
>>> link issues on my Mellanox cards with its old firmware before.
>>
>> Good idea; I'll check that, too. Also, I just learned that these
>> cards on installed with an extender card because of some space issues,
>> so we're going to test again without the extender.
>
> Hi Bjorn,
>
> Are other cards that passed your test installed directly to the
> on-board PCIe slot?
> If yes, then this is a good data point and it will be useful to test
> the case where
> your Mellanox cards are directly installed into the on-board PCIe slot.

The cards that passed the test were installed directly, with no
extender. We removed the extender from one of the machines with the
Mellanox card and have not seen this issue since then. I think it's
very likely that the problem is related to using the extender.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/