RE: [PATCH rdma-next 1/1] RDMA/mana_ib: process QP error events

From: Konstantin Taranov
Date: Thu Jun 06 2024 - 04:52:20 EST


> > +
> > + switch (event->type) {
> > + case GDMA_EQE_RNIC_QP_FATAL:
> > + qpn = event->details[0];
> > + xa_lock_irqsave(&mdev->qp_table_rq, flag);
> > + qp = xa_load(&mdev->qp_table_rq, qpn);
> > + if (qp)
> > + refcount_inc(&qp->refcount);
>
>
> Move this to after checking for "if (!qp) break".

Then we can have a race condition.
Imagine that EQE came when we tried to destroy QP and the got the following execution order:
EQE | QP destroy
1 xa_lock |
2 qp = xa_load |
3 xa_unlock |
4 | xa_erase_irq
5 | refcount_dec
6 | complete
7 refcount_inc | <-------- BUG

>
> > + xa_unlock_irqrestore(&mdev->qp_table_rq, flag);
> > + if (!qp)
> > + break;
> > + if (qp->ibqp.event_handler) {
> > + ev.device = qp->ibqp.device;
> > + ev.element.qp = &qp->ibqp;
> > + ev.event = IB_EVENT_QP_FATAL;
> > + qp->ibqp.event_handler(&ev, qp->ibqp.qp_context);
> > + }
> > + if (refcount_dec_and_test(&qp->refcount))
> > + complete(&qp->free);
> > + break;

> Strange logic. Why not do:
> if (!refcount_dec_and_test(&qp->refcount))
> wait_for_completion(&qp->free);
>

It might work, but the logic will be even stranger and it will prevent some debugging.
With the proposed change, qp->free may not be completed even though the counter is 0.
As a result, the change makes an incorrect state to be an expected state, thereby making bugs with that side effect undetectable.
E.g., we have a bug "use after free" and then we try to trace whether qp was in use.
Plus, it is a good practice deinit everything that was inited. With the proposed change it is violated.