Re: [PATCH for-next] RDMA/cm: Rate limit destroy CM ID timeout error message

From: Haakon Bugge
Date: Thu Sep 25 2025 - 07:30:20 EST

Next message: Arto Merilainen: "Re: [PATCH v4 07/10] PCI/IDE: Add IDE establishment helpers"
Previous message: Andrew Cooper: "Re: [PATCH 3/3] objtool/x86: Fix NOP decode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Jason and Jake,

> On 16 Sep 2025, at 16:36, Jacob Moroni <jmoroni@xxxxxxxxxx> wrote:
>
> Does this happen when there is a missing send completion?
>
> Asking because I remember triggering this if a device encounters an
> unrecoverable
> error/VF reset while under heavy RDMA-CM activity (like a large scale
> MPI wire-up).
>
> I assumed it was because RDMA-CM was waiting for TX completions that
> would never arrive.
>
> Of course, the unrecoverable error/VF reset without generating flush
> completions was the real
> bug in my case.

I concur. I looked ahead of the first incident, but didn't see any obscure mlx5 driver messages. But looking in-between, I saw:

kernel: mlx5_core 0000:13:01.1 ens4f16: TX timeout detected
kernel: cm_destroy_id_wait_timeout: cm_id=00000000564a7a31 timed out. state 2 -> 0, refcnt=2
kernel: mlx5_core 0000:13:01.1 ens4f16: TX timeout on queue: 12, SQ: 0x14f2a, CQ: 0x1739, SQ Cons: 0x0 SQ Prod: 0x3c5, usecs since last trans: 30224000
kernel: cm_destroy_id_wait_timeout: cm_id=00000000b821dcda timed out. state 2 -> 0, refcnt=1
kernel: cm_destroy_id_wait_timeout: cm_id=00000000edf170fa timed out. state 2 -> 0, refcnt=1
kernel: mlx5_core 0000:13:01.1 ens4f16: EQ 0x14: Cons = 0x444670, irqn = 0x28c

Not in close proximity in time, but a 6 digits amount of messages were suppressed due to the flooding.

My take is that the timeout should be monotonic increasing from the driver to RDMA_CM (and to the ULPs). They are not, as the mlx5e_build_nic_netdev() functions sets the ndetdev's watchdog_timeo to 15 seconds, whereas the timeout value calling cm_destroy_id_wait_timeout() is 10 seconds.

So, the mitigation by detecting a TX timeout from netdev has not kicked in when cm_destroy_id_wait_timeout() is called.

Thxs, Håkon

Next message: Arto Merilainen: "Re: [PATCH v4 07/10] PCI/IDE: Add IDE establishment helpers"
Previous message: Andrew Cooper: "Re: [PATCH 3/3] objtool/x86: Fix NOP decode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]