There is no need to ping anyone, the patch is registered in patchworks
https://urldefense.com/v3/__https://patchwork.kernel.org/project/linux-rdma/patch/20221005174521.63619-1-rohit.sajan.kumar@xxxxxxxxxx/__;!!ACWV5N9M2RV99hQ!IdRYzr4ujJ0haaWRKGd05SNbDiiW4v85yzCS233IObdO6ziwUhmEpWBC73PMs1dwbjwL5qHv9YwJrmKNtIo$
and we will get to it.
You sent the patch during merge window, no wonder that none looked on it.
On Wed, Oct 05, 2022 at 10:45:20AM -0700, Rohit Nair wrote:
As PRM defines, the bytewise XOR of the EQE and the EQE index should beI didn't find anything like this in my version of PRM.
0xff. Otherwise, we can assume we have a corrupt EQE. The same is
applicable to CQE as well.
Adding a check to verify the EQE and CQE is valid in that aspect and ifWhile it is nice to see prints in dmesg, you need to explain why other
not, dump the CQE and EQE to dmesg to be inspected.
mechanisms (reporters, mlx5 events, e.t.c) are not enough.
We also conducted several extensive performance tests using our test-suite which utilizes rds-stress and also saw no significant performance degrdations in those results.
This patch does not introduce any significant performance degradationsWhat does it mean? You made changes in kernel verbs flow, they are not
and has been tested using qperf.
executed through qperf.
Will update dev_err to mlx5_err.Suggested-by: Michael Guralnik <michaelgur@xxxxxxxxxx>mlx5_err ... and not dev_err ...
Signed-off-by: Rohit Nair <rohit.sajan.kumar@xxxxxxxxxx>
---
drivers/infiniband/hw/mlx5/cq.c | 40 ++++++++++++++++++++++++++++
drivers/net/ethernet/mellanox/mlx5/core/eq.c | 39 +++++++++++++++++++++++++++
2 files changed, 79 insertions(+)
diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
index be189e0..2a6d722 100644
--- a/drivers/infiniband/hw/mlx5/cq.c
+++ b/drivers/infiniband/hw/mlx5/cq.c
@@ -441,6 +441,44 @@ static void mlx5_ib_poll_sw_comp(struct mlx5_ib_cq *cq, int num_entries,
}
}
+static void verify_cqe(struct mlx5_cqe64 *cqe64, struct mlx5_ib_cq *cq)
+{
+ int i = 0;
+ u64 temp_xor = 0;
+ struct mlx5_ib_dev *dev = to_mdev(cq->ibcq.device);
+
+ u32 cons_index = cq->mcq.cons_index;
+ u64 *eight_byte_raw_cqe = (u64 *)cqe64;
+ u8 *temp_bytewise_xor = (u8 *)(&temp_xor);
+ u8 cqe_bytewise_xor = (cons_index & 0xff) ^
+ ((cons_index & 0xff00) >> 8) ^
+ ((cons_index & 0xff0000) >> 16);
+
+ for (i = 0; i < sizeof(struct mlx5_cqe64); i += 8) {
+ temp_xor ^= *eight_byte_raw_cqe;
+ eight_byte_raw_cqe++;
+ }
+
+ for (i = 0; i < (sizeof(u64)); i++) {
+ cqe_bytewise_xor ^= *temp_bytewise_xor;
+ temp_bytewise_xor++;
+ }
+
+ if (cqe_bytewise_xor == 0xff)
+ return;
+
+ dev_err(&dev->mdev->pdev->dev,
+ "Faulty CQE - checksum failure: cqe=0x%x cqn=0x%x cqe_bytewise_xor=0x%x\n",
+ cq->ibcq.cqe, cq->mcq.cqn, cqe_bytewise_xor);
+ dev_err(&dev->mdev->pdev->dev,
+ "cons_index=%u arm_sn=%u irqn=%u cqe_size=0x%x\n",
+ cq->mcq.cons_index, cq->mcq.arm_sn, cq->mcq.irqn, cq->mcq.cqe_sz);
+No BUG() in new code.
+ print_hex_dump(KERN_WARNING, "", DUMP_PREFIX_OFFSET,
+ 16, 1, cqe64, sizeof(*cqe64), false);
+ BUG();