RE: [PATCH for-next v9 0/5] On-Demand Paging on SoftRoCE

From: Daisuke Matsuda (Fujitsu)
Date: Tue Dec 24 2024 - 03:54:16 EST


On Mon, Dec 23, 2024 10:55 AM Daisuke Matsuda (Fujitsu) <matsuda-daisuke@xxxxxxxxxxx> wrote:
> On Mon, Dec 23, 2024 2:25 AM Joe Klein <joe.klein812@xxxxxxxxx> wrote:
> > We have tested this patcheset and had a lot of problems, even without using the ODP option in softroce. I don't know if
> others have done similar tests. If we have to merge this patchset into upstream, is it > possible to add a kernel option to
> enable/disable this patchset?
>
> Hi Joe,
>
> Can you clarify the test and the problems you observed?
> I wonder if you tried the test with the latest tree WITHOUT my patches.
>
> As far as I know, there is something wrong with the upstream right now.
> It does not complete the rdma-core testcases, and 'segmentation fault' is observed
> in the middle of the full test run, which did not happen before October 2024.

It appears that the root cause of this issue lies within the userspace components.
My report yesterday was based on experiments conducted on Ubuntu 24.04.1 LTS (x86_64).
It seems to me that rxe is somehow broken regardless of kernel version
as long as userspace components are provided by Ubuntu 24.04.1 LTS.
I built and tried linux-6.11, linux-6.10, and linux-6.8, and they all failed as I reported.

I switched to Ubuntu 22.04.5 LTS (aarch64) to test with the older libraries.
All tests available passed using the rdma for-next tree without any problem.
Then, I applied my ODP patches onto it, and everything is still fine.
####################
ubuntu@rdma-aarch64:~/rdma-core$ git branch -v
* master fb965e2d0 Merge pull request #1531 from selvintxavier/pbuf_optimization
ubuntu@rdma-aarch64:~/rdma-core$ ./build/bin/run_tests.py
..........ss..........ssssssssss..............ssssssssssssssssssssssssss.sssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss........ssssss..ss....s.sssssss....ss....ss..............s......................ss.............sss...ssss
----------------------------------------------------------------------
Ran 321 tests in 3.599s

OK (skipped=211)
ubuntu@rdma-aarch64:~/rdma-core$ ./build/bin/run_tests.py -k odp
sssssssss..ss....s.s
----------------------------------------------------------------------
Ran 20 tests in 0.269s

OK (skipped=13)
####################

Possibly, there was a regression in libibverbs between v39.0-1 and v50.0-2build2.
We need to take a closer look to resolve the malfunction of rxe on Ubuntu 24.04.

In conclusion, I believe there is nothing in my ODP patches that could cause
the rxe driver to fail. I would appreciate any feedback on potential improvements.

Thanks,
Daisuke

>
> Here are the details of the issue:
> ===== test log =====
> ubuntu@rdma-dev:~$ sudo rdma link add rxe_ens3 type rxe netdev ens3
> ubuntu@rdma-dev:~$ cd rdma-core
> ubuntu@rdma-dev:~/rdma-core$ uname -r
> 6.13.0-rc1+
> ubuntu@rdma-dev:~/rdma-core$ pwd
> /home/ubuntu/rdma-core
> ubuntu@rdma-dev:~/rdma-core$ ./build/bin/run_tests.py
> ..........ss.../usr/lib/python3.12/_weakrefset.py:39: ResourceWarning: unclosed file <_io.FileIO name='/tmp/tmpe7nsitov'
> mode='rb+' closefd=True>
> def _remove(item, selfref=ref(self)):
> ResourceWarning: Enable tracemalloc to get the object allocation traceback
> /usr/lib/python3.12/_weakrefset.py:39: ResourceWarning: unclosed file <_io.FileIO name='/tmp/tmpid85cbou'
> mode='rb+' closefd=True>
> def _remove(item, selfref=ref(self)):
> ResourceWarning: Enable tracemalloc to get the object allocation traceback
> .......ssssss/usr/lib/python3.12/contextlib.py:141: ResourceWarning: unclosed file <_io.FileIO
> name='/tmp/tmp9pgb7zo8' mode='rb+' closefd=True>
> def __exit__(self, typ, value, traceback):
> ResourceWarning: Enable tracemalloc to get the object allocation traceback
> ssss..............ssssssssssssssssssssssssss.ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
> sssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss........ssssssssssssssssss/
> usr/lib/python3.12/_weakrefset.py:39: ResourceWarning: unclosed file <_io.FileIO name='/tmp/tmpate1loci'
> mode='rb+' closefd=True>
> def _remove(item, selfref=ref(self)):
> ResourceWarning: Enable tracemalloc to get the object allocation traceback
> Traceback (most recent call last):
> File "pd.pyx", line 120, in pyverbs.pd.PD.close
> pyverbs.pyverbs_error.PyverbsRDMAError: Failed to dealloc PD. Errno: 9, Bad file descriptor
> Exception ignored in: 'pyverbs.pd.PD.__dealloc__'
> Traceback (most recent call last):
> File "pd.pyx", line 120, in pyverbs.pd.PD.close
> pyverbs.pyverbs_error.PyverbsRDMAError: Failed to dealloc PD. Errno: 9, Bad file descriptor
> ssssTraceback (most recent call last):
> File "pd.pyx", line 120, in pyverbs.pd.PD.close
> pyverbs.pyverbs_error.PyverbsRDMAError: Failed to dealloc PD. Errno: 9, Bad file descriptor
> Exception ignored in: 'pyverbs.pd.PD.__dealloc__'
> Traceback (most recent call last):
> File "pd.pyx", line 120, in pyverbs.pd.PD.close
> pyverbs.pyverbs_error.PyverbsRDMAError: Failed to dealloc PD. Errno: 9, Bad file descriptor
> Traceback (most recent call last):
> File "pd.pyx", line 120, in pyverbs.pd.PD.close
> pyverbs.pyverbs_error.PyverbsRDMAError: Failed to dealloc PD. Errno: 9, Bad file descriptor
> Exception ignored in: 'pyverbs.pd.PD.__dealloc__'
> Traceback (most recent call last):
> File "pd.pyx", line 120, in pyverbs.pd.PD.close
> pyverbs.pyverbs_error.PyverbsRDMAError: Failed to dealloc PD. Errno: 9, Bad file descriptor
> Traceback (most recent call last):
> File "pd.pyx", line 120, in pyverbs.pd.PD.close
> pyverbs.pyverbs_error.PyverbsRDMAError: Failed to dealloc PD. Errno: 9, Bad file descriptor
> Exception ignored in: 'pyverbs.pd.PD.__dealloc__'
> Traceback (most recent call last):
> File "pd.pyx", line 120, in pyverbs.pd.PD.close
> pyverbs.pyverbs_error.PyverbsRDMAError: Failed to dealloc PD. Errno: 9, Bad file descriptor
> s....ssSegmentation fault (core dumped)
> ===========
>
> =====dmesg=====
> [ 147.464243] rxe_ens3: qp#21 make_send_cqe: non-flush error status = 4
> [ 147.473843] rxe_ens3: qp#23 make_send_cqe: non-flush error status = 10
> [ 147.484540] rxe_ens3: qp#25 make_send_cqe: non-flush error status = 9
> [ 147.494541] rxe_ens3: qp#27 make_send_cqe: non-flush error status = 10
> [ 147.524080] rxe_ens3: rxe_create_cq: returned err = -22
> [ 147.574197] rxe_ens3: cq#26 rxe_resize_cq: returned err = -22
> [ 147.605719] rxe_ens3: rxe_create_cq: returned err = -95
> [ 147.606454] rxe_ens3: rxe_create_cq: returned err = -22
> [ 148.803131] rxe_ens3: qp#51 make_send_cqe: non-flush error status = 10
> [ 148.831587] rxe_ens3: qp#57 make_send_cqe: non-flush error status = 10
> [ 148.841627] rxe_ens3: qp#59 make_send_cqe: non-flush error status = 10
> [ 148.851719] rxe_ens3: qp#61 make_send_cqe: non-flush error status = 10
> [ 149.104223] python3[1702]: segfault at d0 ip 00007ff95ced16c7 sp 00007fff5e775de0 error 4 in
> libibverbs.so.1.14.56.0[e6c7,7ff95ceca000+14000] likely on CPU 2 (core 0, socket 2)
> [ 149.104235] Code: 00 00 c1 e0 04 8b bf 08 01 00 00 48 8d 53 20 48 c7 43 28 00 00 00 00 83 c0 18 c7 43 34 00 00 00 00 be
> 01 1b 18 c0 66 89 43 20 <49> 8b 80 d0 00 00 00 8b 40 10 89 43 30 31 c0 e8 05 99 ff ff 41 89
> =====
>
> If you encounter any problems that surely comes from my ODP patches, please let me know what symptoms you are
> seeing.
> I would also appreciate any help you can offer in fixing the upstream issue.
>
> Thanks,
> Daisuke