Re: [PATCH for-rc 2/9] RDMA/hns: Fix a long wait for cmdq event during reset

From: Junxian Huang
Date: Mon Jul 08 2024 - 02:51:10 EST




On 2024/7/8 13:38, Leon Romanovsky wrote:
> On Mon, Jul 08, 2024 at 10:29:54AM +0800, Junxian Huang wrote:
>>
>>
>> On 2024/7/7 16:30, Leon Romanovsky wrote:
>>> On Fri, Jul 05, 2024 at 04:59:30PM +0800, Junxian Huang wrote:
>>>> From: wenglianfa <wenglianfa@xxxxxxxxxx>
>>>>
>>>> During reset, cmdq events won't be reported, leading to a long and
>>>> unnecessary wait. Notify all the cmdqs to stop waiting at the beginning
>>>> of reset.
>>>>
>>>> Fixes: 9a4435375cd1 ("IB/hns: Add driver files for hns RoCE driver")
>>>> Signed-off-by: wenglianfa <wenglianfa@xxxxxxxxxx>
>>>> Signed-off-by: Junxian Huang <huangjunxian6@xxxxxxxxxxxxx>
>>>> ---
>>>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 18 ++++++++++++++++++
>>>> 1 file changed, 18 insertions(+)
>>>>
>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>> index a5d746a5cc68..ff135df1a761 100644
>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>> @@ -6977,6 +6977,21 @@ static void hns_roce_hw_v2_uninit_instance(struct hnae3_handle *handle,
>>>>
>>>> handle->rinfo.instance_state = HNS_ROCE_STATE_NON_INIT;
>>>> }
>>>> +
>>>> +static void hns_roce_v2_reset_notify_cmd(struct hns_roce_dev *hr_dev)
>>>> +{
>>>> + struct hns_roce_cmdq *hr_cmd = &hr_dev->cmd;
>>>> + int i;
>>>> +
>>>> + if (!hr_dev->cmd_mod)
>>>
>>> What prevents cmd_mod from being changed?
>>>
>>
>> It's set when the device is being initialized, and won't be changed after that.
>
> This is exactly the point, you are assuming that the device is already
> ininitialized or not initialized at all. What prevents hns_roce_v2_reset_notify_cmd()
> from being called in the middle of initialization?
>
> Thanks
>

This is ensured by hns3 NIC driver.

Initialization and reset of hns RoCE are both called by hns3. It will check the state
of RoCE device (see line 3798), and notify RoCE device to reset (hns_roce_v2_reset_notify_cmd()
is called) only if the RoCE device has been already initialized:

3791 static int hclge_notify_roce_client(struct hclge_dev *hdev,
3792 enum hnae3_reset_notify_type type)
3793 {
3794 struct hnae3_handle *handle = &hdev->vport[0].roce;
3795 struct hnae3_client *client = hdev->roce_client;
3796 int ret;
3797
3798 if (!test_bit(HCLGE_STATE_ROCE_REGISTERED, &hdev->state) || !client)
3799 return 0;
3800
3801 if (!client->ops->reset_notify)
3802 return -EOPNOTSUPP;
3803
3804 ret = client->ops->reset_notify(handle, type);
3805 if (ret)
3806 dev_err(&hdev->pdev->dev, "notify roce client failed %d(%d)",
3807 type, ret);
3808
3809 return ret;
3810 }

And the bit is set (see line 11246) after the initialization has been done (line 11242):

11224 static int hclge_init_roce_client_instance(struct hnae3_ae_dev *ae_dev,
11225 struct hclge_vport *vport)
11226 {
11227 struct hclge_dev *hdev = ae_dev->priv;
11228 struct hnae3_client *client;
11229 int rst_cnt;
11230 int ret;
11231
11232 if (!hnae3_dev_roce_supported(hdev) || !hdev->roce_client ||
11233 !hdev->nic_client)
11234 return 0;
11235
11236 client = hdev->roce_client;
11237 ret = hclge_init_roce_base_info(vport);
11238 if (ret)
11239 return ret;
11240
11241 rst_cnt = hdev->rst_stats.reset_cnt;
11242 ret = client->ops->init_instance(&vport->roce);
11243 if (ret)
11244 return ret;
11245
11246 set_bit(HCLGE_STATE_ROCE_REGISTERED, &hdev->state);
11247 if (test_bit(HCLGE_STATE_RST_HANDLING, &hdev->state) ||
11248 rst_cnt != hdev->rst_stats.reset_cnt) {
11249 ret = -EBUSY;
11250 goto init_roce_err;
11251 }

Junxian

>>
>> Junxian
>>
>>>> + return;
>>>> +
>>>> + for (i = 0; i < hr_cmd->max_cmds; i++) {
>>>> + hr_cmd->context[i].result = -EBUSY;
>>>> + complete(&hr_cmd->context[i].done);
>>>> + }
>>>> +}
>>>> +
>>>> static int hns_roce_hw_v2_reset_notify_down(struct hnae3_handle *handle)
>>>> {
>>>> struct hns_roce_dev *hr_dev;
>>>> @@ -6997,6 +7012,9 @@ static int hns_roce_hw_v2_reset_notify_down(struct hnae3_handle *handle)
>>>> hr_dev->dis_db = true;
>>>> hr_dev->state = HNS_ROCE_DEVICE_STATE_RST_DOWN;
>>>>
>>>> + /* Complete the CMDQ event in advance during the reset. */
>>>> + hns_roce_v2_reset_notify_cmd(hr_dev);
>>>> +
>>>> return 0;
>>>> }
>>>>
>>>> --
>>>> 2.33.0
>>>>
>>