RE: [RFC PATCH 2/7] RDMA/rxe: Convert the triple tasklets to workqueues

From: matsuda-daisuke@xxxxxxxxxxx
Date: Mon Sep 12 2022 - 04:31:34 EST


On Sat, Sep 10, 2022 4:39 AM Bob Pearson wrote:
> On 9/6/22 21:43, Daisuke Matsuda wrote:
> > In order to implement On-Demand Paging on the rxe driver, triple tasklets
> > (requester, responder, and completer) must be allowed to sleep so that they
> > can trigger page fault when pages being accessed are not present.
> >
> > This patch replaces the tasklets with workqueues, but still allows direct-
> > call of works from softirq context if it is obvious that MRs are not going
> > to be accessed and there is no work being processed in the workqueue.
> >
> > As counterparts to tasklet_disable() and tasklet_enable() are missing for
> > workqueues, an atomic value is introduced to get works suspended while qp
> > reset is in progress.
> >
> > As a reference, performance change was measured using ib_send_bw and
> > ib_send_lat commands over RC connection. Both the client and the server
> > were placed on the same KVM host. An option "-n 100000" was given to the
> > respective commands to iterate over 100000 times.
> >
> > Before applying this patch:
> > [ib_send_bw]
> > ---------------------------------------------------------------------------------------
> > #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
> > 65536 100000 0.00 203.13 0.003250
> > ---------------------------------------------------------------------------------------
> > [ib_send_lat]
> > ---------------------------------------------------------------------------------------
> > #bytes #iterations t_min[usec] t_max[usec] t_typical[usec] t_avg[usec] t_stdev[usec] 99%
> percentile[usec] 99.9% percentile[usec]
> > 2 100000 30.91 1519.05 34.23 36.06 5.15 48.49
> 82.37
> > ---------------------------------------------------------------------------------------
> >
> > After applying this patch:
> > [ib_send_bw]
> > ---------------------------------------------------------------------------------------
> > #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
> > 65536 100000 0.00 240.88 0.003854
> > ---------------------------------------------------------------------------------------
> > [ib_send_lat]
> > ---------------------------------------------------------------------------------------
> > #bytes #iterations t_min[usec] t_max[usec] t_typical[usec] t_avg[usec] t_stdev[usec] 99%
> percentile[usec] 99.9% percentile[usec]
> > 2 100000 40.88 2994.82 47.80 50.25 13.70 76.42
> 185.04
> > ---------------------------------------------------------------------------------------
> >
> > The test was conducted 3 times for each kernel, and the results with median
> > "BW average" and "t_typical" are shown above. It shows the conversion
> > improves the bandwidth while causing higher latency.
> >
> > Signed-off-by: Daisuke Matsuda <matsuda-daisuke@xxxxxxxxxxx>
> > ---
> > drivers/infiniband/sw/rxe/Makefile | 2 +-
> > drivers/infiniband/sw/rxe/rxe_comp.c | 42 ++++---
> > drivers/infiniband/sw/rxe/rxe_loc.h | 2 +-
> > drivers/infiniband/sw/rxe/rxe_net.c | 4 +-
> > drivers/infiniband/sw/rxe/rxe_param.h | 2 +-
> > drivers/infiniband/sw/rxe/rxe_qp.c | 68 +++++------
> > drivers/infiniband/sw/rxe/rxe_recv.c | 2 +-
> > drivers/infiniband/sw/rxe/rxe_req.c | 14 +--
> > drivers/infiniband/sw/rxe/rxe_resp.c | 16 +--
> > drivers/infiniband/sw/rxe/rxe_task.c | 152 ------------------------
> > drivers/infiniband/sw/rxe/rxe_task.h | 69 -----------
> > drivers/infiniband/sw/rxe/rxe_verbs.c | 8 +-
> > drivers/infiniband/sw/rxe/rxe_verbs.h | 8 +-
> > drivers/infiniband/sw/rxe/rxe_wq.c | 161 ++++++++++++++++++++++++++
> > drivers/infiniband/sw/rxe/rxe_wq.h | 71 ++++++++++++
> > 15 files changed, 322 insertions(+), 299 deletions(-)
> > delete mode 100644 drivers/infiniband/sw/rxe/rxe_task.c
> > delete mode 100644 drivers/infiniband/sw/rxe/rxe_task.h
> > create mode 100644 drivers/infiniband/sw/rxe/rxe_wq.c
> > create mode 100644 drivers/infiniband/sw/rxe/rxe_wq.h

<...>

>
> Daiskuke,
>
> We (hpe) have an internal patch set that also allows using work queues instead of tasklets.
> There is a certain amount of work to allow control over cpu assignment from ulp or application
> level. This is critical for high performance in multithreaded IO applications. It would be in
> both of our interests if we could find a common implementation that works for everyone.

I agree with you, and I am interested in your work.

Would you post the patch set? It may be possible to rebase my work on your implementation.
I just simply replaced tasklets mechanically and modified the conditions when dispatching works
to responder and completer workqueues. The changes are not so complicated. I'll be away between
9/15-9/20. After that, I can take the time to look into it.

By the way, I would also like you to join in the discussion about whether to preserve tasklets or not.
Cf. https://lore.kernel.org/all/TYCPR01MB8455D739FC9FB034E3485C87E5449@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/

>
> Perhaps we could arrange for a call to discuss?

This is an important change that may affect all of rxe users,
so I think the discussion should be open to anybody at least at this stage.

Thanks,
Daisuke