iscsi deadlock while login and logout 2 iscsi nodes in tight loop at unstable network environment

From: wangqiang (AY)
Date: Mon Jun 19 2017 - 09:37:02 EST

Next message: Alexey Budankov: "Re: [PATCH v3 1/n] perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi"
Previous message: zhong jiang: "Re: [PATCH] x86/mm: Don't reenter flush_tlb_func_common()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi all,

In the upstream 4.12.0-rc5 kernel, I encounter a deadlock problem while login and logout 2 iscsi nodes in tight loop at unstable network environment, the iscsid will be D state.

SysRq-w reported the following call stacks:
[10660.667133] sysrq: SysRq : Show Blocked State
[10660.671536]   task                        PC stack   pid father
[10660.672137] iscsid          D    0 11415      1 0x00000000
[10660.672139] Call Trace:
[10660.672141] __schedule+0x1d9/0x780
[10660.672144] schedule+0x36/0x80
[10660.672146] async_synchronize_cookie_domain+0xaa/0x150
[10660.672149] ? remove_wait_queue+0x60/0x60
[10660.672151] async_synchronize_full_domain+0x18/0x20
[10660.672153] sd_remove+0x4d/0xc0 [sd_mod]
[10660.672156] device_release_driver_internal+0x141/0x200
[10660.672158] device_release_driver+0x12/0x20
[10660.672159] bus_remove_device+0xfd/0x170
[10660.672161] device_del+0x1e8/0x330
[10660.672164] __scsi_remove_device+0xc6/0xd0
[10660.672166] scsi_remove_device+0x26/0x40
[10660.672168] scsi_remove_target+0x165/0x1b0
[10660.672174] __iscsi_unbind_session+0xbe/0x170 [scsi_transport_iscsi]
[10660.672180] iscsi_remove_session+0xd6/0x190 [scsi_transport_iscsi]
[10660.672186] iscsi_destroy_session+0x12/0x50 [scsi_transport_iscsi]
[10660.672192] iscsi_session_teardown+0xd5/0xf0 [libiscsi]
[10660.672195] iscsi_sw_tcp_session_destroy+0x48/0x70 [iscsi_tcp]
[10660.672202] iscsi_if_recv_msg+0xbb1/0x1390 [scsi_transport_iscsi]
[10660.672205] ? __kmalloc_node_track_caller+0x199/0x280
[10660.672211] iscsi_if_rx+0x8e/0x1f0 [scsi_transport_iscsi]
[10660.672214] netlink_unicast+0x181/0x240
[10660.672216] netlink_sendmsg+0x2c2/0x3b0
[10660.672220] sock_sendmsg+0x38/0x50
[10660.672222] ___sys_sendmsg+0x27a/0x290
[10660.672224] ? kernfs_refresh_inode+0xc8/0xe0
[10660.672227] ? _copy_to_user+0x26/0x40
[10660.672230] __sys_sendmsg+0x54/0x90
[10660.672232] SyS_sendmsg+0x12/0x20
[10660.672234] entry_SYSCALL_64_fastpath+0x1a/0xa5
[10660.672325] kworker/u48:2   D    0 22364      2 0x00000000
[10660.672332] Workqueue: scsi_wq_7 __iscsi_unbind_session [scsi_transport_iscsi]
[10660.672333] Call Trace:
[10660.672336] __schedule+0x1d9/0x780
[10660.672350] ? set_next_entity+0x7c/0x7a0
[10660.672353] schedule+0x36/0x80
[10660.672354] schedule_timeout+0x249/0x300
[10660.672356] ? finish_task_switch+0xec/0x240
[10660.672359] wait_for_completion+0x121/0x180
[10660.672361] ? wake_up_q+0x80/0x80
[10660.672364] flush_workqueue+0x16d/0x490
[10660.672366] ? probe_sched_wakeup+0x1c/0x20
[10660.672369] drain_workqueue+0x5b/0x130
[10660.672371] destroy_workqueue+0x1a/0x200
[10660.672373] scsi_host_dev_release+0x6f/0x130
[10660.672375] device_release+0x32/0x90
[10660.672377] kobject_cleanup+0x63/0x170
[10660.672379] kobject_put+0x28/0x50
[10660.672381] put_device+0x17/0x20
[10660.672383] scsi_host_put+0x15/0x20
[10660.672389] iscsi_session_release+0x3d/0x80 [scsi_transport_iscsi]
[10660.672390] device_release+0x32/0x90
[10660.672393] kobject_cleanup+0x63/0x170
[10660.672395] kobject_put+0x28/0x50
[10660.672396] put_device+0x17/0x20
[10660.672398] scsi_target_dev_release+0x1f/0x30
[10660.672400] device_release+0x32/0x90
[10660.672402] kobject_cleanup+0x63/0x170
[10660.672404] kobject_put+0x28/0x50
[10660.672406] put_device+0x17/0x20
[10660.672407] scsi_target_destroy+0xc5/0xe0
[10660.672409] scsi_target_reap_ref_release+0x35/0x40
[10660.672411] scsi_target_reap+0x2e/0x40
[10660.672413] scsi_remove_target+0x197/0x1b0
[10660.672419] __iscsi_unbind_session+0xbe/0x170 [scsi_transport_iscsi]
[10660.672421] process_one_work+0x149/0x360
[10660.672423] worker_thread+0x4d/0x3c0
[10660.672425] kthread+0x109/0x140
[10660.672427] ? rescuer_thread+0x380/0x380
[10660.672428] ? kthread_park+0x60/0x60
[10660.672430] ret_from_fork+0x25/0x30

sd_remove() waits for async_synchronize_cookie_domain which is defined in sd_probe() and relate to sd_probe_async(), but due to unstable network environment, the session which process sd_probe_async() became ISCSI_STATE_FAILED state, and IO corresponding to sd_probe_async() doesnt complete, iscsi_conn_failure() send event to iscsid and wait for respond to stop connection, iscsid cannot respond as the sd_remove() is waiting for completion of sd_probe_async(). Hence the deadlock.

So, is any method to fix this deadlock?
Any reply would be thankful.

Execute the shell scripts below at the initiator side and wait for 5 ~ 20 minutes, the deadlock can be reproduced.(nodes need at least 10 luns respectively.)
#!/bin/bash
# discovery 2 nodes
iscsiadm -m discovery -p IP_A -t st
iscsiadm -m discovery -p IP_B -t st
# set network loss rate (eth1 connected to the ipsan)
tc qdisc add dev eth1 root netem loss 60%
# loop test
n=1
while [ 1 ]
do
     echo "$(date) ==== test loop $n ===="
     iscsiadm -m node -T iqn.xxxx::yyyy:IP_A --login
     sleep 5
     iscsiadm -m node -T iqn.xxxx::yyyy:IP_A --logout&
     iscsiadm -m node -T iqn.xxxx::yyyz:IP_B --login&
     sleep 10
     iscsiadm -m node -u
     n=$(($n + 1))
     echo "$(date) ==== continue to test loop ===="
done

Next message: Alexey Budankov: "Re: [PATCH v3 1/n] perf/core: addressing 4x slowdown during per-process profiling of STREAM benchmark on Intel Xeon Phi"
Previous message: zhong jiang: "Re: [PATCH] x86/mm: Don't reenter flush_tlb_func_common()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]