Re: modprobe mlx5_core on OCI bare-metal instance causes unrecoverable hang and I/O error
From: Saeed Mahameed
Date: Fri Feb 07 2025 - 14:33:41 EST
On 07 Feb 13:24, Mitchell Augustin wrote:
*facepalm*
Thanks, I can't believe that wasn't my first thought as soon as I
learned these instances were using iSCSI. That's almost certainly what
is happening on this OCI instance, since the host adapter for its
iSCSI transport is a ConnectX card.
The fact that I was able to see similar behavior once on a machine
booted from a local disk (in the A100 test I mentioned) is still
confusing though. I'll update this thread if I can figure out a
reliable way to reproduce that behavior.
BTW I saw this happening in few instances of virtualized environments as
well, where the VM storage is network/RDMA packed by the host driver/network,
when the host driver restarts (which is a normal behavior for a PV setup),
the VM also get storage related timeouts and soft lockups. Graceful
shutdown needs to be handled inside of the network backed block devices
IMHO.
-Saeed.