Re: [PATCH RFC net-next] net/smc: transition to RDMA core CQ pooling
From: D. Wythe
Date: Thu Feb 12 2026 - 04:52:41 EST
On Wed, Feb 11, 2026 at 08:51:30PM +0800, Dust Li wrote:
> On 2026-02-02 17:48:00, D. Wythe wrote:
> >The current SMC-R implementation relies on global per-device CQs
> >and manual polling within tasklets, which introduces severe
> >scalability bottlenecks due to global lock contention and tasklet
> >scheduling overhead, resulting in poor performance as concurrency
> >increases.
> >
> >Refactor the completion handling to utilize the ib_cqe API and
> >standard RDMA core CQ pooling. This transition provides several key
> >advantages:
> >
> >1. Multi-CQ: Shift from a single shared per-device CQ to multiple
> >link-specific CQs via the CQ pool. This allows completion processing
> >to be parallelized across multiple CPU cores, effectively eliminating
> >the global CQ bottleneck.
> >
> >2. Leverage DIM: Utilizing the standard CQ pool with IB_POLL_SOFTIRQ
> >enables Dynamic Interrupt Moderation from the RDMA core, optimizing
> >interrupt frequency and reducing CPU load under high pressure.
> >
> >3. O(1) Context Retrieval: Replaces the expensive wr_id based lookup
> >logic (e.g., smc_wr_tx_find_pending_index) with direct context retrieval
> >using container_of() on the embedded ib_cqe.
> >
> >4. Code Simplification: This refactoring results in a reduction of
> >~150 lines of code. It removes redundant sequence tracking, complex lookup
> >helpers, and manual CQ management, significantly improving maintainability.
>
> Excellent !
>
> Some comments below.
>
> >
> >Performance Test: redis-benchmark with max 32 connections per QP
> >Data format: Requests Per Second (RPS), Percentage in brackets
> >represents the gain/loss compared to TCP.
> >
> >| Clients | TCP | SMC (original) | SMC (cq_pool) |
> >|---------|----------|---------------------|---------------------|
> >| c = 1 | 24449 | 31172 (+27%) | 34039 (+39%) |
> >| c = 2 | 46420 | 53216 (+14%) | 64391 (+38%) |
> >| c = 16 | 159673 | 83668 (-48%) <-- | 216947 (+36%) |
> >| c = 32 | 164956 | 97631 (-41%) <-- | 249376 (+51%) |
> >| c = 64 | 166322 | 118192 (-29%) <-- | 249488 (+50%) |
> >| c = 128 | 167700 | 121497 (-27%) <-- | 249480 (+48%) |
> >| c = 256 | 175021 | 146109 (-16%) <-- | 240384 (+37%) |
> >| c = 512 | 168987 | 101479 (-40%) <-- | 226634 (+34%) |
> >
> >The results demonstrate that this optimization effectively resolves the
> >scalability bottleneck, with RPS increasing by over 110% at c=64
> >compared to the original implementation.
> >
> >Signed-off-by: D. Wythe <alibuda@xxxxxxxxxxxxxxxxx>
> >---
> > net/smc/smc_core.c | 8 +-
> > net/smc/smc_core.h | 16 ++-
> > net/smc/smc_ib.c | 114 ++++++-------------
> > net/smc/smc_ib.h | 5 -
> > net/smc/smc_tx.c | 1 -
> > net/smc/smc_wr.c | 267 ++++++++++++++++-----------------------------
> > net/smc/smc_wr.h | 38 ++-----
> > 7 files changed, 150 insertions(+), 299 deletions(-)
> >
> >diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
> >index 8aca5dc54be7..9590c8aed3dd 100644
> >--- a/net/smc/smc_core.c
> >+++ b/net/smc/smc_core.c
> >@@ -815,17 +815,11 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
> > lnk->lgr = lgr;
> > smc_lgr_hold(lgr); /* lgr_put in smcr_link_clear() */
> > lnk->link_idx = link_idx;
> >- lnk->wr_rx_id_compl = 0;
> > smc_ibdev_cnt_inc(lnk);
> > smcr_copy_dev_info_to_link(lnk);
> > atomic_set(&lnk->conn_cnt, 0);
>
> I think this isn't unlikey, we haven't signal the RDMA WRITE wr, so it
> shouldn't have CQEs when wc->status == 0.
> If we are here, wc->status should always != 0.
>
Nice catch, I mixed this up with the rwwi version.
>
> >+}
> >+
> >+static void smc_wr_reg_process_cqe(struct ib_cq *cq, struct ib_wc *wc)
> >- sizeof(*link->wr_tx_v2_pend));
> Why remove this memset ?
>
> >- memset(link->lgr->wr_tx_buf_v2, 0,
> >- sizeof(*link->lgr->wr_tx_buf_v2));
> >+ memset(link->lgr->wr_tx_buf_v2, 0, sizeof(*link->lgr->wr_tx_buf_v2));
> > } else {
> >- link->wr_tx_pends[pnd_snd_idx].wc_status = wc->status;
> >- if (link->wr_tx_pends[pnd_snd_idx].compl_requested)
> >+ if (tx_pend->compl_requested)
> > complete(&link->wr_tx_compl[pnd_snd_idx]);
> >- memcpy(&pnd_snd, &link->wr_tx_pends[pnd_snd_idx],
> >- sizeof(pnd_snd));
> >- /* clear the full struct smc_wr_tx_pend including .priv */
> >- memset(&link->wr_tx_pends[pnd_snd_idx], 0,
> >- sizeof(link->wr_tx_pends[pnd_snd_idx]));
>
> ditto
>
Looks like I accidentally deleted this code, I will restore it in the
next version.
> >- memset(&link->wr_tx_bufs[pnd_snd_idx], 0,
> >- sizeof(link->wr_tx_bufs[pnd_snd_idx]));
> >+ memset(&link->wr_tx_bufs[tx_pend->idx], 0, sizeof(link->wr_tx_bufs[tx_pend->idx]));
> > if (!test_and_clear_bit(pnd_snd_idx, link->wr_tx_mask))
> > return;
> > }
> >
> >- if (wc->status) {
> >+ if (unlikely(wc->status)) {
> > if (link->lgr->smc_version == SMC_V2) {
> > memset(link->wr_tx_v2_pend, 0,
> > sizeof(*link->wr_tx_v2_pend));
> >@@ -128,44 +117,12 @@ static inline void smc_wr_tx_process_cqe(struct ib_wc *wc)
> > /* terminate link */
> > smcr_link_down_cond_sched(link);
> > }
> >+ smc_wr_rx_init_cqe(&link->wr_rx_ibs[i].cqe);
> >+ link->wr_rx_ibs[i].wr.wr_cqe = &link->wr_rx_ibs[i].cqe;
> >+ link->wr_rx_ibs[i].index = i;
> >+ }
>
> Can we group all those xxx_init_cqe into a function and move them out of
> smc_wr_alloc_link_mem() ? All what smc_wr_alloc_link_mem() is all about
> allocating memory.
>
> Maybe we can define a smc_wr_init_cqes() and call it in
> smc_wr_create_link() ?
>
>
Sounds reasonable, I'll implement this in the next version.
Additionally, other items like removing redundant fields and renaming
the index will also be addressed in the same update.
Thanks,
D. Wythe
> Best regards,
> Dust