RE: [PATCH] IB/mlx4: Fix stale CM id_map entries when RTU is never received
From: Praveen Kannoju
Date: Mon Jun 08 2026 - 08:35:59 EST
Confidential - Oracle Restricted \Including External Recipients
Yes, this is a separate issue from the earlier REJ handling.
In this case, when the remote node drops the reply as a duplicate, the source side can retain the `id_map_entry` indefinitely, which leaves a stale mapping behind.
Thank you for pointing me to the UAF concern and the review link. I will evaluate the locking and lifetime handling carefully, fix the patch as needed, and resend an updated version.
Confidential - Oracle Restricted \Including External Recipients
> -----Original Message-----
> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Wednesday, June 3, 2026 5:56 AM
> To: Praveen Kannoju <praveen.kannoju@xxxxxxxxxx>
> Cc: yishaih@xxxxxxxxxx; leon@xxxxxxxxxx; linux-rdma@xxxxxxxxxxxxxxx; linux-
> kernel@xxxxxxxxxxxxxxx; Anand Khoje <anand.a.khoje@xxxxxxxxxx>;
> Manjunath Patil <manjunath.b.patil@xxxxxxxxxx>
> Subject: Re: [PATCH] IB/mlx4: Fix stale CM id_map entries when RTU is never
> received
>
> On Thu, May 07, 2026 at 03:47:55PM +0000, Praveen Kumar Kannoju wrote:
> > mlx4_ib_multiplex_cm_handler() allocates an id_map_entry for CM
> > transactions, but the entry is only released on DREQ or REJ flows.
> >
> > In the duplicate REP handling scenario, cm_dup_rep_handler() may get
> > invoked when the remote side receives a REP for which no matching
> > cm_id_priv exists. In such cases the CM handshake never reaches RTU,
> > and the sender side may never receive either DREQ or REJ cleanup events.
> >
> > As a result, the allocated id_map_entry remains indefinitely,
> > resulting in a stale mapping leak.
> >
> > Fix this by scheduling delayed cleanup immediately after allocating
> > the id_map_entry. The delayed work is cancelled once CM_RTU_ATTR_ID is
> > received, indicating that the CM handshake completed successfully.
> >
> > This ensures abandoned mappings are eventually reclaimed even when RTU
> > is never received.
> >
> > Signed-off-by: Praveen Kumar Kannoju <praveen.kannoju@xxxxxxxxxx>
> > ---
> > drivers/infiniband/hw/mlx4/cm.c | 10 ++++++++++
> > 1 file changed, 10 insertions(+)
> >
> > diff --git a/drivers/infiniband/hw/mlx4/cm.c
> > b/drivers/infiniband/hw/mlx4/cm.c index 63a868a3822f..700a840d491d
> > 100644
> > --- a/drivers/infiniband/hw/mlx4/cm.c
> > +++ b/drivers/infiniband/hw/mlx4/cm.c
> > @@ -299,6 +299,7 @@ static void schedule_delayed(struct ib_device
> > *ibdev, struct id_map_entry *id) }
> >
> > #define REJ_REASON(m) be16_to_cpu(((struct cm_generic_msg
> > *)(m))->rej_reason)
> > +#define RTU_RECEIVE_TIMEOUT (60 * HZ)
> > int mlx4_ib_multiplex_cm_handler(struct ib_device *ibdev, int port, int
> slave_id,
> > struct ib_mad *mad)
> > {
> > @@ -321,6 +322,9 @@ int mlx4_ib_multiplex_cm_handler(struct ib_device
> *ibdev, int port, int slave_id
> > __func__, slave_id, sl_cm_id);
> > return PTR_ERR(id);
> > }
> > +
> > + schedule_delayed_work(&id->timeout,
> RTU_RECEIVE_TIMEOUT);
>
> So this is a distinct problem from the other one? Can you put all these mlx4
> bugs into one series?
>
> Why does this open code schedule_delayed() and remove all the locking?
>
> Sashiko even points out this might create a UAF:
>
> https://sashiko.dev/#/patchset/20260507154755.452008-1-
> praveen.kannoju%40oracle.com
>
> Jason