Hi Yishai,
Johannes has been working on an mlx4 initialization problem on an
IBM x3850 X6. The underlying problem is a PCI core issue -- we're
setting RCB in the Mellanox device, which means it thinks it can
generate 128-byte Completions, even though the Root Port above it
can't handle them. That issue is
https://bugzilla.kernel.org/show_bug.cgi?id=187781
The machine crashed when this happened, apparently not because of any
error reported via AER, but because mlx4 contains a BUG_ON, probably
the one in mlx4_enter_error_state().
That one happens if pci_channel_offline() returns false. Is this
telling us about a problem in PCI error handling, or is it just a case
where mlx4 isn't as smart as it could be?