Re: [RFC PATCH net v2 2/2] net/smc: Resolve the race between SMC-R link access and clear

From: dust.li
Date: Wed Dec 29 2021 - 23:00:33 EST


On Wed, Dec 29, 2021 at 01:51:27PM +0100, Karsten Graul wrote:
>On 28/12/2021 16:13, Wen Gu wrote:
>> We encountered some crashes caused by the race between SMC-R
>> link access and link clear triggered by link group termination
>> in abnormal case, like port error.
>
>Without to dig deeper into this, there is already a refcount for links, see smc_wr_tx_link_hold().
>In smc_wr_free_link() there are waits for the refcounts to become zero.
>
>Why do you need to introduce another refcounting instead of using the existing?
>And if you have a good reason, do we still need the existing refcounting with your new
>implementation?
>
>Maybe its enough to use the existing refcounting in the other functions like smc_llc_flow_initiate()?
>
>Btw: it is interesting what kind of crashes you see, we never met them in our setup.

We are trying to using SMC + RDMA to boost application performance,
we now have a product in the cloud called ERDMA which can be used
in the virtual machine.

We are testing SMC with link down/up with short flow cases since
in the cloud environment the RDMA device may be plugged in/out
frequently, and there are many different applications, some of them
may have pretty much short flows.

>Its great to see you evaluating SMC in a cloud environment!

Thanks! We are trying to use SMC to boost performance for cloud
applications, and we hope SMC can be more generic and widely used.