Re: modprobe mlx5_core on OCI bare-metal instance causes unrecoverable hang and I/O error

From: Saeed Mahameed
Date: Fri Feb 07 2025 - 14:33:41 EST


On 07 Feb 13:24, Mitchell Augustin wrote:
*facepalm*

Thanks, I can't believe that wasn't my first thought as soon as I
learned these instances were using iSCSI. That's almost certainly what
is happening on this OCI instance, since the host adapter for its
iSCSI transport is a ConnectX card.

The fact that I was able to see similar behavior once on a machine
booted from a local disk (in the A100 test I mentioned) is still
confusing though. I'll update this thread if I can figure out a
reliable way to reproduce that behavior.


BTW I saw this happening in few instances of virtualized environments as
well, where the VM storage is network/RDMA packed by the host driver/network,
when the host driver restarts (which is a normal behavior for a PV setup), the VM also get storage related timeouts and soft lockups. Graceful
shutdown needs to be handled inside of the network backed block devices
IMHO.

-Saeed.