Hi Andrea & Robert,
We have hit this four times today. Any ideas?
[ 169.382113] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 169.382152] IP: [<ffffffffa051e968>] isert_login_recv_done+0x28/0x170 [ib_isert]
So, we spent more time to track down this bug.
It seems that at login time an error is happening, not sure exactly what
kind of error, but isert_connect_error() is called and isert_conn->cm_id
is set to NULL.
Later, isert_login_recv_done() is trying to access
isert_conn->cm_id->device and we get the NULL pointer dereference.
Following there's the patch that we have applied to track down this
problem.
And this is what we see in dmesg with this patch applied:
[ 658.633188] isert: isert_connect_error: conn ffff887f2209c000 error
[ 658.633226] isert: isert_login_recv_done: login with broken rdma_cm_id
As we can see isert_connect_error() is called before isert_login_recv_done
and at that point isert_conn->cm_id is NULL.
Obviously simply checking if the pointer is NULL, returning and ignoring
the error in isert_login_recv_done() is not the best fix ever and I'm
not sure if I'm breaking something else doing so (even if with this
patch the kernel doesn't crash and I've not seen any problem so far).
Maybe a better way is to tear down the whole connection when this
particular case is happening? Suggestions?
So I assume isert_cma_handler() -> isert_connect_error() getting called
to clear isert_conn->cm_id before connection established, and
subsequently isert_conn->login_req_buf->rx_cqe.done() ->
isert_login_recv_done() still getting invoked after connection failure
is new RDMA API behavior..
@@ -1452,7 +1452,7 @@
isert_login_recv_done(struct ib_cq *cq, struct ib_wc *wc)
{
struct isert_conn *isert_conn = wc->qp->qp_context;
- struct ib_device *ib_dev = isert_conn->cm_id->device;
+ struct ib_device *ib_dev = isert_conn->device->ib_device;
if (unlikely(wc->status != IB_WC_SUCCESS)) {
isert_print_wc(wc, "login recv");