RE: [PATCH for-next v9 0/5] On-Demand Paging on SoftRoCE

From: Daisuke Matsuda (Fujitsu)
Date: Sun Dec 22 2024 - 20:55:20 EST


On Mon, Dec 23, 2024 2:25 AM Joe Klein <joe.klein812@xxxxxxxxx> wrote:
> We have tested this patcheset and had a lot of problems, even without using the ODP option in softroce. I don't know if others have done similar tests. If we have to merge this patchset into upstream, is it > possible to add a kernel option to enable/disable this patchset?

Hi Joe,

Can you clarify the test and the problems you observed?
I wonder if you tried the test with the latest tree WITHOUT my patches.

As far as I know, there is something wrong with the upstream right now.
It does not complete the rdma-core testcases, and 'segmentation fault' is observed
in the middle of the full test run, which did not happen before October 2024.

Here are the details of the issue:
===== test log =====
ubuntu@rdma-dev:~$ sudo rdma link add rxe_ens3 type rxe netdev ens3
ubuntu@rdma-dev:~$ cd rdma-core
ubuntu@rdma-dev:~/rdma-core$ uname -r
6.13.0-rc1+
ubuntu@rdma-dev:~/rdma-core$ pwd
/home/ubuntu/rdma-core
ubuntu@rdma-dev:~/rdma-core$ ./build/bin/run_tests.py
..........ss.../usr/lib/python3.12/_weakrefset.py:39: ResourceWarning: unclosed file <_io.FileIO name='/tmp/tmpe7nsitov' mode='rb+' closefd=True>
def _remove(item, selfref=ref(self)):
ResourceWarning: Enable tracemalloc to get the object allocation traceback
/usr/lib/python3.12/_weakrefset.py:39: ResourceWarning: unclosed file <_io.FileIO name='/tmp/tmpid85cbou' mode='rb+' closefd=True>
def _remove(item, selfref=ref(self)):
ResourceWarning: Enable tracemalloc to get the object allocation traceback
.......ssssss/usr/lib/python3.12/contextlib.py:141: ResourceWarning: unclosed file <_io.FileIO name='/tmp/tmp9pgb7zo8' mode='rb+' closefd=True>
def __exit__(self, typ, value, traceback):
ResourceWarning: Enable tracemalloc to get the object allocation traceback
ssss..............ssssssssssssssssssssssssss.sssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss........ssssssssssssssssss/usr/lib/python3.12/_weakrefset.py:39: ResourceWarning: unclosed file <_io.FileIO name='/tmp/tmpate1loci' mode='rb+' closefd=True>
def _remove(item, selfref=ref(self)):
ResourceWarning: Enable tracemalloc to get the object allocation traceback
Traceback (most recent call last):
File "pd.pyx", line 120, in pyverbs.pd.PD.close
pyverbs.pyverbs_error.PyverbsRDMAError: Failed to dealloc PD. Errno: 9, Bad file descriptor
Exception ignored in: 'pyverbs.pd.PD.__dealloc__'
Traceback (most recent call last):
File "pd.pyx", line 120, in pyverbs.pd.PD.close
pyverbs.pyverbs_error.PyverbsRDMAError: Failed to dealloc PD. Errno: 9, Bad file descriptor
ssssTraceback (most recent call last):
File "pd.pyx", line 120, in pyverbs.pd.PD.close
pyverbs.pyverbs_error.PyverbsRDMAError: Failed to dealloc PD. Errno: 9, Bad file descriptor
Exception ignored in: 'pyverbs.pd.PD.__dealloc__'
Traceback (most recent call last):
File "pd.pyx", line 120, in pyverbs.pd.PD.close
pyverbs.pyverbs_error.PyverbsRDMAError: Failed to dealloc PD. Errno: 9, Bad file descriptor
Traceback (most recent call last):
File "pd.pyx", line 120, in pyverbs.pd.PD.close
pyverbs.pyverbs_error.PyverbsRDMAError: Failed to dealloc PD. Errno: 9, Bad file descriptor
Exception ignored in: 'pyverbs.pd.PD.__dealloc__'
Traceback (most recent call last):
File "pd.pyx", line 120, in pyverbs.pd.PD.close
pyverbs.pyverbs_error.PyverbsRDMAError: Failed to dealloc PD. Errno: 9, Bad file descriptor
Traceback (most recent call last):
File "pd.pyx", line 120, in pyverbs.pd.PD.close
pyverbs.pyverbs_error.PyverbsRDMAError: Failed to dealloc PD. Errno: 9, Bad file descriptor
Exception ignored in: 'pyverbs.pd.PD.__dealloc__'
Traceback (most recent call last):
File "pd.pyx", line 120, in pyverbs.pd.PD.close
pyverbs.pyverbs_error.PyverbsRDMAError: Failed to dealloc PD. Errno: 9, Bad file descriptor
s....ssSegmentation fault (core dumped)
===========

=====dmesg=====
[ 147.464243] rxe_ens3: qp#21 make_send_cqe: non-flush error status = 4
[ 147.473843] rxe_ens3: qp#23 make_send_cqe: non-flush error status = 10
[ 147.484540] rxe_ens3: qp#25 make_send_cqe: non-flush error status = 9
[ 147.494541] rxe_ens3: qp#27 make_send_cqe: non-flush error status = 10
[ 147.524080] rxe_ens3: rxe_create_cq: returned err = -22
[ 147.574197] rxe_ens3: cq#26 rxe_resize_cq: returned err = -22
[ 147.605719] rxe_ens3: rxe_create_cq: returned err = -95
[ 147.606454] rxe_ens3: rxe_create_cq: returned err = -22
[ 148.803131] rxe_ens3: qp#51 make_send_cqe: non-flush error status = 10
[ 148.831587] rxe_ens3: qp#57 make_send_cqe: non-flush error status = 10
[ 148.841627] rxe_ens3: qp#59 make_send_cqe: non-flush error status = 10
[ 148.851719] rxe_ens3: qp#61 make_send_cqe: non-flush error status = 10
[ 149.104223] python3[1702]: segfault at d0 ip 00007ff95ced16c7 sp 00007fff5e775de0 error 4 in libibverbs.so.1.14.56.0[e6c7,7ff95ceca000+14000] likely on CPU 2 (core 0, socket 2)
[ 149.104235] Code: 00 00 c1 e0 04 8b bf 08 01 00 00 48 8d 53 20 48 c7 43 28 00 00 00 00 83 c0 18 c7 43 34 00 00 00 00 be 01 1b 18 c0 66 89 43 20 <49> 8b 80 d0 00 00 00 8b 40 10 89 43 30 31 c0 e8 05 99 ff ff 41 89
=====

If you encounter any problems that surely comes from my ODP patches, please let me know what symptoms you are seeing.
I would also appreciate any help you can offer in fixing the upstream issue.

Thanks,
Daisuke