Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message
From: Haakon Bugge
Date: Thu Sep 25 2025 - 07:30:20 EST
Hi Jason and Jake,
> On 16 Sep 2025, at 16:36, Jacob Moroni <jmoroni@xxxxxxxxxx> wrote:
>
> Does this happen when there is a missing send completion?
>
> Asking because I remember triggering this if a device encounters an
> unrecoverable
> error/VF reset while under heavy RDMA-CM activity (like a large scale
> MPI wire-up).
>
> I assumed it was because RDMA-CM was waiting for TX completions that
> would never arrive.
>
> Of course, the unrecoverable error/VF reset without generating flush
> completions was the real
> bug in my case.
I concur. I looked ahead of the first incident, but didn't see any obscure mlx5 driver messages. But looking in-between, I saw:
kernel: mlx5_core 0000:13:01.1 ens4f16: TX timeout detected
kernel: cm_destroy_id_wait_timeout: cm_id=00000000564a7a31 timed out. state 2 -> 0, refcnt=2
kernel: mlx5_core 0000:13:01.1 ens4f16: TX timeout on queue: 12, SQ: 0x14f2a, CQ: 0x1739, SQ Cons: 0x0 SQ Prod: 0x3c5, usecs since last trans: 30224000
kernel: cm_destroy_id_wait_timeout: cm_id=00000000b821dcda timed out. state 2 -> 0, refcnt=1
kernel: cm_destroy_id_wait_timeout: cm_id=00000000edf170fa timed out. state 2 -> 0, refcnt=1
kernel: mlx5_core 0000:13:01.1 ens4f16: EQ 0x14: Cons = 0x444670, irqn = 0x28c
Not in close proximity in time, but a 6 digits amount of messages were suppressed due to the flooding.
My take is that the timeout should be monotonic increasing from the driver to RDMA_CM (and to the ULPs). They are not, as the mlx5e_build_nic_netdev() functions sets the ndetdev's watchdog_timeo to 15 seconds, whereas the timeout value calling cm_destroy_id_wait_timeout() is 10 seconds.
So, the mitigation by detecting a TX timeout from netdev has not kicked in when cm_destroy_id_wait_timeout() is called.
Thxs, Håkon