Re: [PATCH net-next v3] net/smc: transition to RDMA core CQ pooling
From: D. Wythe
Date: Sat Mar 07 2026 - 05:08:35 EST
On Fri, Mar 06, 2026 at 05:37:49PM +0530, Mahanta Jambigi wrote:
>
>
> On 05/03/26 7:53 am, D. Wythe wrote:
> > The current SMC-R implementation relies on global per-device CQs
> > and manual polling within tasklets, which introduces severe
> > scalability bottlenecks due to global lock contention and tasklet
> > scheduling overhead, resulting in poor performance as concurrency
> > increases.
> >
> > Refactor the completion handling to utilize the ib_cqe API and
> > standard RDMA core CQ pooling. This transition provides several key
> > advantages:
> >
> > 1. Multi-CQ: Shift from a single shared per-device CQ to multiple
> > link-specific CQs via the CQ pool. This allows completion processing
> > to be parallelized across multiple CPU cores, effectively eliminating
> > the global CQ bottleneck.
> >
> > 2. Leverage DIM: Utilizing the standard CQ pool with IB_POLL_SOFTIRQ
> > enables Dynamic Interrupt Moderation from the RDMA core, optimizing
> > interrupt frequency and reducing CPU load under high pressure.
> >
> > 3. O(1) Context Retrieval: Replaces the expensive wr_id based lookup
> > logic (e.g., smc_wr_tx_find_pending_index) with direct context retrieval
> > using container_of() on the embedded ib_cqe.
> >
> > 4. Code Simplification: This refactoring results in a reduction of
> > ~150 lines of code. It removes redundant sequence tracking, complex lookup
> > helpers, and manual CQ management, significantly improving maintainability.
> >
> > Performance Test: redis-benchmark with max 32 connections per QP
> > Data format: Requests Per Second (RPS), Percentage in brackets
> > represents the gain/loss compared to TCP.
> >
> > | Clients | TCP | SMC (original) | SMC (cq_pool) |
> > |---------|----------|---------------------|---------------------|
> > | c = 1 | 24449 | 31172 (+27%) | 34039 (+39%) |
> > | c = 2 | 46420 | 53216 (+14%) | 64391 (+38%) |
> > | c = 16 | 159673 | 83668 (-48%) <-- | 216947 (+36%) |
> > | c = 32 | 164956 | 97631 (-41%) <-- | 249376 (+51%) |
> > | c = 64 | 166322 | 118192 (-29%) <-- | 249488 (+50%) |
> > | c = 128 | 167700 | 121497 (-27%) <-- | 249480 (+48%) |
> > | c = 256 | 175021 | 146109 (-16%) <-- | 240384 (+37%) |
> > | c = 512 | 168987 | 101479 (-40%) <-- | 226634 (+34%) |
> >
> > The results demonstrate that this optimization effectively resolves the
> > scalability bottleneck, with RPS increasing by over 110% at c=64
> > compared to the original implementation.
>
> Since your performance results look really-really nice on x86 but ours
> show severe degradations on s390x, one way forward could be adding the
> cq_poll mechanism but also keeping the existing mechanism for now
> (because the things are right now it works better on s390x) and making
> it either runtime or compile time configurable which of the both is
> going to be used.
>
> Alternatively, we could work together making the cq_poll mechanism does
> not introduce a regression to s390x (ideally improve performance for
> s390x as well). But it that case we would like to have this change
> deferred until we find a way to make the regression disappear.
>
> I am aware that the first option, co-existence, would kill the
> simplification aspect of this and instead introduce added complexity.
> But we are talking about a major regression here on one end, and major
> improvements on the other end, so it might be still worth it. In any
> case, we are very motivated to eventually get rid of the old mechanism,
> provided significant performance regressions can be avoided.
I'm in no rush to push this, since a significant performance degradation
was observed on s390x, I'll withdraw this patch until the issue is
resolved, and it would be great if you could investigate what specifically
happened on s390x.
D. Wythe