Re: [PATCH] RDMA/rxe: destroy the mcg when rxe_mcast_add() fails in rxe_get_mcg()
From: Zhu Yanjun
Date: Sun Jun 14 2026 - 21:28:34 EST
在 2026/6/14 6:04, Michael Bommarito 写道:
rxe_get_mcg() inserts the new mcg into rxe->mcg_tree and takes the tree
reference before calling rxe_mcast_add() outside mcg_lock. On failure
the error path frees the mcg with a bare kfree() without erasing the
tree node or dropping the tree reference, so the freed mcg stays linked
in mcg_tree and the next __rxe_lookup_mcg() on the same mgid uses it
after free. rxe_mcast_add() fails reachably from an unprivileged caller:
-ENODEV when the backing netdev is removed, or a propagated dev_mc_add()
error.
Tear the mcg down with __rxe_destroy_mcg() on the failure path, as
rxe_attach_mcast() already does.
Reproduced under KASAN on QEMU by forcing the rxe_mcast_add() failure;
the use-after-free in __rxe_lookup_mcg() is gone after this change.
Fixes: a926a903b7dc ("RDMA/rxe: Do not call dev_mc_add/del() under a spinlock")
Cc: stable@xxxxxxxxxxxxxxx # v5.18+
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Michael Bommarito <michael.bommarito@xxxxxxxxx>
---
Reproduction (v7.1-rc4, x86_64 QEMU/KVM, KASAN, Soft-RoCE):
Forcing rxe_mcast_add() to return -ENODEV, an unprivileged ATTACH_MCAST
on a UD QP leaves the freed mcg linked in mcg_tree. On the stock kernel
the next lookup reports
BUG: KASAN: slab-use-after-free in __rxe_lookup_mcg
and the subsequent rb_erase() panics. Patched, the forced failure
returns cleanly. Control: with injection disabled, re-attach and detach
of the same MGID and a two-QP join/leave are KASAN-clean on both trees.
tools/testing/selftests/rdma has no rxe_mcast coverage; harness off-list
on request.
drivers/infiniband/sw/rxe/rxe_mcast.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/drivers/infiniband/sw/rxe/rxe_mcast.c b/drivers/infiniband/sw/rxe/rxe_mcast.c
index 5cad720..7f148d4 100644
--- a/drivers/infiniband/sw/rxe/rxe_mcast.c
+++ b/drivers/infiniband/sw/rxe/rxe_mcast.c
@@ -196,6 +196,8 @@ static void __rxe_init_mcg(struct rxe_dev *rxe, union ib_gid *mgid,
__rxe_insert_mcg(mcg);
}
+static void __rxe_destroy_mcg(struct rxe_mcg *mcg);
+
/**
* rxe_get_mcg - lookup or allocate a mcg
* @rxe: rxe device object
@@ -247,7 +249,13 @@ static struct rxe_mcg *rxe_get_mcg(struct rxe_dev *rxe, union ib_gid *mgid)
if (!err)
return mcg;
- kfree(mcg);
+ /* mcg was made visible in mcg_tree; unwind the insert before freeing. */
+ spin_lock_bh(&rxe->mcg_lock);
+ __rxe_destroy_mcg(mcg);
+ spin_unlock_bh(&rxe->mcg_lock);
+ kref_put(&mcg->ref_cnt, rxe_cleanup_mcg);
+ return ERR_PTR(err);
+
Thanks for fixing the UAF. While this patch resolves the single-threaded issue, it introduces a severe race condition in concurrent environments.
Because rxe_mcast_add() runs outside the mcg_lock, a concurrent thread can find this mcg in the tree and successfully attach its own QPs during this window.
If the creator thread unconditionally erases the mcg from the tree on failure, those concurrent QPs become "orphaned." Future rxe_detach_mcast() calls will fail to find the erased mcg, causing these QPs and the mcg memory to leak permanently.
Attempting to simplify the rollback by unconditionally destroying the node or merging unlock paths can easily lead to executing kfree or a nested lock acquisition while still holding the mcg_lock spinlock, triggering a kernel deadlock or a double rb_erase panic.
The error path must conditionally destroy the mcg. After re-acquiring rxe->mcg_lock, check if mcg->qp_list is empty:
If empty: Safe to dismantle. Call __rxe_destroy_mcg(), drop the lock, and put the final reference.
If NOT empty: Concurrent threads have adopted it. Do not erase the tree node; simply release the lock and drop the creator's reference.
Please consider submitting a v2 addressing this concurrency gap.
Zhu Yanjun
err_dec:
atomic_dec(&rxe->mcg_num);
return ERR_PTR(err);
base-commit: 5200f5f493f79f14bbdc349e402a40dfb32f23c8