Re: [PATCH] target: Fix a double put in transport_free_session

From: Mike Christie
Date: Fri Mar 26 2021 - 12:25:08 EST


On 3/26/21 7:31 AM, Maurizio Lombardi wrote:
>
>
> Dne 23. 03. 21 v 3:58 Lv Yunlong napsal(a):
>> In transport_free_session, se_nacl is got from se_sess
>> with the initial reference. If se_nacl->acl_sess_list is
>> empty, se_nacl->dynamic_stop is set to true. Then the first
>> target_put_nacl(se_nacl) will drop the initial reference
>> and free se_nacl. Later there is a second target_put_nacl()
>> to put se_nacl. It may cause error in race.
>>
>> My patch sets se_nacl->dynamic_stop to false to avoid the
>> double put.
>>
>> Signed-off-by: Lv Yunlong <lyl2019@xxxxxxxxxxxxxxxx>
>> ---
>> drivers/target/target_core_transport.c | 4 +++-
>> 1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c
>> index 5ecb9f18a53d..c266defe694f 100644
>> --- a/drivers/target/target_core_transport.c
>> +++ b/drivers/target/target_core_transport.c
>> @@ -584,8 +584,10 @@ void transport_free_session(struct se_session *se_sess)
>> }
>> mutex_unlock(&se_tpg->acl_node_mutex);
>>
>> - if (se_nacl->dynamic_stop)
>> + if (se_nacl->dynamic_stop) {
>> target_put_nacl(se_nacl);
>> + se_nacl->dynamic_stop = false;
>> + }
>>
>> target_put_nacl(se_nacl);
>> }
>>
>
> FYI,
>
> I have received a bug report against the 5.8 kernel about task hangs that seems to involve the nacl "dynamic_stop" code
>
> Mar 4 16:49:44 gzboot kernel: [186322.177819] INFO: task targetcli:2359053 blocked for more than 120 seconds.
> Mar 4 16:49:44 gzboot kernel: [186322.178862] Tainted: P O 5.8.0-44-generic #50-Ubuntu
> Mar 4 16:49:44 gzboot kernel: [186322.179746] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Mar 4 16:49:44 gzboot kernel: [186322.180583] targetcli D 0 2359053 2359052 0x00000000
> Mar 4 16:49:44 gzboot kernel: [186322.180586] Call Trace:
> Mar 4 16:49:44 gzboot kernel: [186322.180592] __schedule+0x212/0x5d0
> Mar 4 16:49:44 gzboot kernel: [186322.180595] ? usleep_range+0x90/0x90
> Mar 4 16:49:44 gzboot kernel: [186322.180596] schedule+0x55/0xc0
> Mar 4 16:49:44 gzboot kernel: [186322.180597] schedule_timeout+0x10f/0x160
> Mar 4 16:49:44 gzboot kernel: [186322.180601] ? evict+0x14c/0x1b0
> Mar 4 16:49:44 gzboot kernel: [186322.180602] __wait_for_common+0xa8/0x150
> Mar 4 16:49:44 gzboot kernel: [186322.180603] wait_for_completion+0x24/0x30
> Mar 4 16:49:44 gzboot kernel: [186322.180637] core_tpg_del_initiator_node_acl+0x8e/0x120 [target_core_mod]

We would need more details. We can hit that normally when the target driver
waits for stuck IO to complete before calling transport_deregister_session.
I think if you hit the put bug, then we would have seen the refcount warning
in refcount.h fire before the hung task because we do an extra put.

Did the user switch to ACLs? Did you see my comment about it looks like there
is a bug if the user were to add an acl while dynamic was used for the same
initiatorname. In that case, we do not do the put to match the one taken
core_tpg_check_initiator_node_acl. In that case we would hit your hang since
no one ever does the last put.